Exclude page number from text when extracting from a PDF

50 Views Asked by At

I want to exclude the page number of a PDF from the actual text using pypdf package

from pypdf import PdfReader

reader = PdfReader("pdf-examples/kurdish-sample-2.pdf")
full_text = ""
for page in reader.pages:
    full_text += page.extract_text() + "\n"
print(full_text)

Output:

5 دوارۆژی ئەم منداڵه بکەنەوە کە چۆن و چی بەسەر دێت و دووچاری 

The number 5 is the page number which should be excluded.

1

There are 1 best solutions below

0
simbullar On

You can use pass method if the count of iterations is 5, just like this:

from pypdf import PdfReader

reader = PdfReader("pdf-examples/kurdish-sample-2.pdf")
full_text = ""

def extract_pages(reader, text):
    i=1
    for page in reader.pages:
        if i == 5:
            pass
        else:
            text += page.extract_text() + "\n"
        i = i + 1
    return text
full_text = extract_pages(reader, full_text)

Here we use i as iterations counter and add it everytime the page is read we add the counter by one. So if i is 5, that means that we are on the fifth page, and we just dont do anything with it by writing pass.