Python difference between 0 and 0̸

340 Views Asked by At

I have some python 3 script to extract text from pdf's using PyPDF2 module. I am searching for digits/numerical values for example 534,000.00. However, when I extract text from the pdf and put it into a string the number looks like this 534,000.00 with the three zero's following the comma as regular zero's (0) but the two after the decimal as 0̸0̸.

Am I missing something here?

When I copied the 534,000.00 from the pdf to this form it looked like: 534,OOO.00. I'm not sure what is going on.

sample code for just 1 pdf:

for file in os.listdir(file_path):
if file[-7:] == "303.PDF":
    with open(file_path + file, 'rb') as pdfobj:
        pdfReader = PyPDF2.PdfFileReader(pdfobj, strict=False)
        num_pages = pdfReader.numPages

        while count < num_pages:
            pageobj = pdfReader.getPage(count)
            text += pageobj.extractText()
            count += 1

        # prints nothing
        if re.search('534,000.00', text):
            print("found it")

        # finds it correctly
        if re.search('534,OOO.00', text):
            print("found it")
1

There are 1 best solutions below

2
Niklas Mertsch On

Your OOO's are three letters O as in Object, not the digit 'zero' (0).

Don't ask me, why someone would use letters instead of digits on purpose or if some text-recognition-program thought, these had to be letters...

You could use [0O] in all your regex to match both.