I want to extract the text of a PDF document and use some regular expressions to filter for information.
I am coding in Python 3.7.4 using fitz for parsing the PDF document. The PDF document is written in German. My code looks as follows:
doc = fitz.open(pdfpath)
pagecount = doc.pageCount
page = 0
content = ""
while (page < pagecount):
p = doc.loadPage(page)
page += 1
content = content + p.getText()
Printing the content, I realized that the first (and important) half of the document is decoded as a strange mix of Japanese (?) signs and others, like this: ョ。オウキ・ゥエオョァ@ュ.
I tried to solve it with different decodings (latin-1 and iso-8859-1), but the encoding is definitely in UTF-8.
content= content+p.getText().encode("utf-8").decode("utf-8")
I also have tried to get the text using minecart:
import minecart
file = open(pdfpath, 'rb')
document = minecart.Document(file)
for page in document.iter_pages():
for lettering in page.letterings :
print(lettering)
which results in the same problem.
Using textract, the first half is an empty string:
import textract
text = textract.process(pdfpath)
print(text.decode('utf-8'))
Same thing with PyPDF2:
import PyPDF2
pdfFileObj = open(pdfpath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for index in range(0, pdfReader.numPages) :
pageObj = pdfReader.getPage(index)
print(pageObj.extractText())
I don't understand the problem as it's looking like a normal PDF document with normal text. Also some of the PDF documents don't have this problem.