reading text from PDF contains unknown encoding

53 Views Asked by At

I'm using PyPDF4 to read text from a PDF I downloaded. This works, but the text string is not readable:

ÓŒŁ–Ł@`@䎖Ł@`@Ä›¥–Ž¢–@¥ŒŒŽ—–fi–Ł
Áfi⁄–fl–Ł–@›ŁƒŒŽfl†£›–

As far as I know the file is not encrypted, I can open it in Acrobat Reader without problem. In reader I can also select / copy / paste the text correctly.

for reference: this is the code:

import glob
import PyPDF4


relevant_path = 'C:\\_Personal\\Mega\\PycharmProjects\\PDFHandler\\docs\\input\\'

if __name__ == '__main__':

    for PDFFile in glob.iglob(relevant_path + '*.pdf', recursive=True):

        print('Processing File: ' + PDFFile.split('\\')[-1])
        pdfReader = PyPDF4.PdfFileReader(PDFFile)
        num_pages = pdfReader.numPages

        print(num_pages)

        page_count = 0
        text = ''

        while page_count < num_pages:
            pageObj = pdfReader.getPage(page_count)
            page_count += 1
            text += pageObj.extractText()

        print(text)

any hints? other packages I could use? ...

0

There are 0 best solutions below