Consider the following article: https://arxiv.org/pdf/2106.13823.pdf

It is an academic paper formatted in two columns.

I want to extract text from a two-column PDF in the natural reading order: first down the left column, then down the right. Does PyPDF PyPDF2 do this by default?

3

There are 3 best solutions below

0
Luuk On

end of page 5, start of page 6:

We apply Π to project ρ⊗N
tonto the subspace
where the total codeword length ltotal∈[NS(ρ,σ)−
ǫ,NS(ρ,σ)+ǫ]:
5
γ=Πρ⊗N
tΠ
tr(Πρ⊗N
t). (29)
We calculate the quantum fidelity to show that our data
compression is indeed faithful:

Conclusion: This document is not suitable for conversion to TEXT, because it contains formula's which cannot be (correctly) converted to text.

Code used (based on: https://stackoverflow.com/a/63518022/724039 ):

from pypdf import PdfReader

reader = PdfReader("2106.13823.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

#print(text)
with open('2106.13823.txt', 'w', encoding="utf-8") as f:
    f.write(text)

Better results are using LibreOffice Writer (version: 24.2.0.3) LibreOffice Writer: end of page 4, start of page 5

0
K J On

A sheet of graphical inputs such as a PDF academic paper charted output is best edited in a graphical application such as MS Word so simply import the PDF as an input for 2 columns and graphics editing.

enter image description here

For simpler text use a command line approach.

enter image description here

For translations google online translate can convert the PDF body text but not the Mathematic Equations so you would need to mix and match with Office "Draw" graphics.

enter image description here

However whatever application you use expect to have to make extensive alterations to the maths so there were several adjustments need in just one small area to reverse the change from LaTeX to PDF.

enter image description here

0
Martin Thoma On

Yes, pypdf can extract text in natural reading order. However, this is mainly the case because that is typically also the order in which the text appears in the document definition. That's just how the PDF is generated.