Can PyPDF extract text from a two-column PDF in the natural reading order: first down the left column, then down the right

88 Views Asked by Alex At 24 February 2024 at 10:22

Consider the following article: https://arxiv.org/pdf/2106.13823.pdf

It is an academic paper formatted in two columns.

I want to extract text from a two-column PDF in the natural reading order: first down the left column, then down the right. Does PyPDF PyPDF2 do this by default?

Original Q&A

There are 3 best solutions below

Luuk On 24 February 2024 at 10:46

end of page 5, start of page 6:

We apply Π to project ρ⊗N
tonto the subspace
where the total codeword length ltotal∈[NS(ρ,σ)−
ǫ,NS(ρ,σ)+ǫ]:
5
γ=Πρ⊗N
tΠ
tr(Πρ⊗N
t). (29)
We calculate the quantum ﬁdelity to show that our data
compression is indeed faithful:

Conclusion: This document is not suitable for conversion to TEXT, because it contains formula's which cannot be (correctly) converted to text.

Code used (based on: https://stackoverflow.com/a/63518022/724039 ):

from pypdf import PdfReader

reader = PdfReader("2106.13823.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

#print(text)
with open('2106.13823.txt', 'w', encoding="utf-8") as f:
    f.write(text)

Better results are using LibreOffice Writer (version: 24.2.0.3)

K J On 24 February 2024 at 11:10

A sheet of graphical inputs such as a PDF academic paper charted output is best edited in a graphical application such as MS Word so simply import the PDF as an input for 2 columns and graphics editing.

For simpler text use a command line approach.

For translations google online translate can convert the PDF body text but not the Mathematic Equations so you would need to mix and match with Office "Draw" graphics.

However whatever application you use expect to have to make extensive alterations to the maths so there were several adjustments need in just one small area to reverse the change from LaTeX to PDF.

Martin Thoma On 25 February 2024 at 12:02

Yes, pypdf can extract text in natural reading order. However, this is mainly the case because that is typically also the order in which the text appears in the document definition. That's just how the PDF is generated.

Can PyPDF extract text from a two-column PDF in the natural reading order: first down the left column, then down the right

There are 3 best solutions below

Related Questions in PYPDF

Related Questions in PYMUPDF

Related Questions in PYPDF4

Trending Questions

Popular # Hahtags

Popular Questions