How to extract text from different pages of pdf and store the output in a text document?

107 Views Asked by Sindhu At 26 September 2023 at 06:34

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open('C:\\sem1\\691-project\\Dataset\\Maths\\A Spiral Workbook for Discrete Mathematics.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfReader(pdfFileObj)

# printing number of pages in pdf file
print(len(pdfReader.pages))

# creating a page object
pageObj = pdfReader.pages[89]

# extracting text from page
print(pageObj.extract_text())

# closing the pdf file object
pdfFileObj.close()

I can extract text from only one page but unable to extract from multiple pages.

Original Q&A

There are 1 best solutions below

Ibrahim E.Gad On 26 September 2023 at 06:53

You just need to iterate over all the pages of the PDF,
See in this line:

pageObj = pdfReader.pages[89]

What you're actually doing is getting page 90 in the document, because pages are zero indexed (start from 0, page 1 is the 0th page, page 2 is the 1st page ....).
Instead loop through all the pages like this :

for pageObj in pdfReader.pages:
    page_text = pageObj.extract_text()
    print(page_text)

To save the text to a file:
You open the file before the loop and append the text of every page to the end of the file.

out_file = open('C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.txt', 'a')
for pageObj in pdfReader.pages:
    page_text = pageObj.extract_text()
    print(page_text)
    out_file.write(page_text)
out_file.close()

Your whole code may look like this:

import PyPDF2
pdfFileObj = open('C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.pdf', 'rb')
pdfReader = PyPDF2.PdfReader(pdfFileObj)
out_file = open('C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.txt', 'a')
for pageObj in pdfReader.pages:
    page_text = pageObj.extract_text()
    print(page_text)
    out_file.write(page_text)
out_file.close()
pdfFileObj.close()

Please not that this code will always append at the end of this file 'C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.txt', if you run this code multiple times, be sure to empty the file first.

How to extract text from different pages of pdf and store the output in a text document?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in DEEP-LEARNING

Related Questions in DOCUMENT-CLASSIFICATION

Trending Questions

Popular # Hahtags

Popular Questions