.pdf to .txt and back again (encoding issues)

111 Views Asked by At

I'm trying to use the refextract Python library to extract references from a bunch of reports. These reports come in both .docx and .pdf formats.

I've noticed that I get better performance from refextract when I convert the .pdf files to .txt and then back to .pdf (I suspect this has to do with the variety of ways that these .pdf files are originally converted from Word and this removes a tonne of "junk"). However, I'm running into encoding and decoding issues.

I'm creating the text files like this:

import fitz
import glob
import os

pdf_files = glob.glob(path + r'\*.pdf')       
for pdf_file in pdf_files:
    with open(pdf_file[:-4]+'.txt', 'w', encoding='utf-8') as outfile:
        doc = fitz.open(pdf_file)
        for page in doc:
            outfile.write(page.get_text())
        doc.close()
    os.remove(pdf_file)

However, when I try to convert this back into a .pdf (using the following code):

from fpdf import FPDF
import glob
import os

txt_files = glob.glob(path + r'\*.txt')
for txt_file in txt_files:
    pdf=FPDF()
    doc=[]
    with open (txt_file, 'r', encoding='utf-8') as infile:
        print(txt_file)
        doc = infile.read()
        pdf.add_page()
        pdf.set_font('Arial', size=12)
        pdf.write(5, doc)
        pdf.output(txt_file[:-4]+'.pdf')
    os.remove(txt_file)

I get the following error: "UnicodeEncodeError: 'latin-1' codec can't encode characters in position 120-121: ordinal not in range(256)"

I've tried messing with the encoding by changing it to 'latin-1' using the following code after I've read the text file in:

doc = doc.encode('latin-1')

When I do this I end up with the following error: "UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 44: ordinal not in range(256)". This file has no euro symbols in the original pdf so I'm extracting something else.

This encoding and decoding is clearly causing me a problem. Any ideas or information would be extremely helpful.

0

There are 0 best solutions below