How to extract text from different pages of pdf and store the output in a text document?

107 Views Asked by At
# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open('C:\\sem1\\691-project\\Dataset\\Maths\\A Spiral Workbook for Discrete Mathematics.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfReader(pdfFileObj)

# printing number of pages in pdf file
print(len(pdfReader.pages))

# creating a page object
pageObj = pdfReader.pages[89]

# extracting text from page
print(pageObj.extract_text())

# closing the pdf file object
pdfFileObj.close()

I can extract text from only one page but unable to extract from multiple pages.

1

There are 1 best solutions below

0
Ibrahim E.Gad On

You just need to iterate over all the pages of the PDF,
See in this line:

pageObj = pdfReader.pages[89]

What you're actually doing is getting page 90 in the document, because pages are zero indexed (start from 0, page 1 is the 0th page, page 2 is the 1st page ....).
Instead loop through all the pages like this :

for pageObj in pdfReader.pages:
    page_text = pageObj.extract_text()
    print(page_text)

To save the text to a file:
You open the file before the loop and append the text of every page to the end of the file.

out_file = open('C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.txt', 'a')
for pageObj in pdfReader.pages:
    page_text = pageObj.extract_text()
    print(page_text)
    out_file.write(page_text)
out_file.close()

Your whole code may look like this:

import PyPDF2
pdfFileObj = open('C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.pdf', 'rb')
pdfReader = PyPDF2.PdfReader(pdfFileObj)
out_file = open('C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.txt', 'a')
for pageObj in pdfReader.pages:
    page_text = pageObj.extract_text()
    print(page_text)
    out_file.write(page_text)
out_file.close()
pdfFileObj.close()

Please not that this code will always append at the end of this file 'C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.txt', if you run this code multiple times, be sure to empty the file first.