Extracting text and comment from Google Doc Python

41 Views Asked by At

I need help with extracting the comments from one of my google docs. Basically I want to get the text that was commented on and also the content from inside the comment box. For example if I commented "This is out of place" on the sentence "Hello World" then I can get both of the texts. If this is not possible to get both, I need the content from the comment box more importantly. The code I have so far is this:

def read_comments(comments):
    comment_text = ''
    for comment in comments:
        comment_text += comment['content']
    return comment_text

def main():
    credentials = get_credentials()
    http = credentials.authorize(Http())
    docs_service = discovery.build(
        'docs', 'v1', http=http, discoveryServiceUrl=DISCOVERY_DOC)
    
    doc = docs_service.documents().get(documentId=DOCUMENT_ID_2).execute()
    doc_content = doc.get('body').get('content')

    comments = docs_service.documents().get(documentId=DOCUMENT_ID_2).execute().get('comments', [])
    comments_text = read_comments(comments)

    print(comments_text)

    sentences = sent_tokenize(comments_text)
    for sentence in sentences:
        sentence = "{This is a PB}" + sentence + "{This is a PB}"
        print(sentence)

if __name__ == '__main__':
    main()

When running this I get no error but there is nothing returned. The list is empty.

1

There are 1 best solutions below

0
msamsami On

You need to use the Google Docs API to fetch the comments of a Google Doc file. This is because comments are not part of the document's content, they are metadata associated with it. Here is a modified script that uses Google Docs API to fetch the comments' content and quoted file content:

def main():
    credentials = get_credentials()
    http = credentials.authorize(Http())
    gdrive_service = discovery.build(
        "drive", "v3", http=http, discoveryServiceUrl=DISCOVERY_DOC
    )
    
    results = service.comments().list(fileId=file_id, fields='*').execute()
    comments = results.get("comments", [])

    # Now, each item in `comments` is a dictionary, with the following fields:
    # 'content', 'quotedFileContent', 'replies', 'author', 'deleted', 'htmlContent', ...
    # The 'content' field contains the comment text
    # The 'quotedFileContent' field contains the text that was commented on

    comments_text = read_comments(comments)

    # Rest of the code
    ...

Note that Google Drive API must be enabled for your project and the document must be shared with the service account's email address.