.pdf to .txt and back again (encoding issues)

111 Views Asked by MaxJay At 27 January 2024 at 17:24

I'm trying to use the refextract Python library to extract references from a bunch of reports. These reports come in both .docx and .pdf formats.

I've noticed that I get better performance from refextract when I convert the .pdf files to .txt and then back to .pdf (I suspect this has to do with the variety of ways that these .pdf files are originally converted from Word and this removes a tonne of "junk"). However, I'm running into encoding and decoding issues.

I'm creating the text files like this:

import fitz
import glob
import os

pdf_files = glob.glob(path + r'\*.pdf')       
for pdf_file in pdf_files:
    with open(pdf_file[:-4]+'.txt', 'w', encoding='utf-8') as outfile:
        doc = fitz.open(pdf_file)
        for page in doc:
            outfile.write(page.get_text())
        doc.close()
    os.remove(pdf_file)

However, when I try to convert this back into a .pdf (using the following code):

from fpdf import FPDF
import glob
import os

txt_files = glob.glob(path + r'\*.txt')
for txt_file in txt_files:
    pdf=FPDF()
    doc=[]
    with open (txt_file, 'r', encoding='utf-8') as infile:
        print(txt_file)
        doc = infile.read()
        pdf.add_page()
        pdf.set_font('Arial', size=12)
        pdf.write(5, doc)
        pdf.output(txt_file[:-4]+'.pdf')
    os.remove(txt_file)

I get the following error: "UnicodeEncodeError: 'latin-1' codec can't encode characters in position 120-121: ordinal not in range(256)"

I've tried messing with the encoding by changing it to 'latin-1' using the following code after I've read the text file in:

doc = doc.encode('latin-1')

When I do this I end up with the following error: "UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 44: ordinal not in range(256)". This file has no euro symbols in the original pdf so I'm extracting something else.

This encoding and decoding is clearly causing me a problem. Any ideas or information would be extremely helpful.

Original Q&A

.pdf to .txt and back again (encoding issues)

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PDF

Related Questions in ENCODING

Related Questions in DECODING

Related Questions in TXT

Trending Questions

Popular # Hahtags

Popular Questions