Is there any python package for extracting text nicely from PDFs in RTL-languages?

2.9k Views Asked by At

I've worked with famous python packages for PDF files, like PDFminer, PyMuPDF, PyPDF2 and more. But none of them can extract text correctly from PDF files which are written in right-to-left languages (Persian, Arabic).

For example:

import fitz
doc = fitz.open("*/path/to/file.pdf")
txt = doc.getPageText(0)
print(txt)

it returns something like this:

...

اﯾﻨﺘﺮﻧﺖ و ﮐﺎﻣﭙﯿﻮﺗﺮ ﺑﻪ ﻣﺴﻠﻂ

ﻣﺴﻠﻂ ﻫﺎیزﺑﺎن

...

Sometimes the words are written reversed (first character comes last) and the words are swapped in a sentence, sometimes words are written correctly. But it does not know how to handle the Zero-width non-joiner (نیم‌فاصله) which is commonly used in Persian.

I tried a lot, But came to nothing. Thanks for your helps, in advance.

3

There are 3 best solutions below

4
ParisaN On

I had this problem, and I wrote following code:

import sys
from builtins import print
import fitz

input_file = "p.pdf"
line_list = []

doc = fitz.Document(input_file)
page_count = doc.pageCount

for i in range(page_count):
    load_page = doc.loadPage(i)
    page = load_page.getText() # read a page
    page = str(page)
    line_list.append(page.splitlines()) # split every page based on \n

for j in range (len(line_list)):
    for k in range(3): 
        line_list[j][k] = line_list[j][k][::-1]
        print(line_list[j][k])

But this package has two problems. 1) Reverses the words (e.g. "سلام" -> "مالس") I solved it in this code. 2) It has problems with documents with multi languages, like Farsi and English.

0
larapsodia On

I think the answer is that you can do this, but no current package really handles RTL languages well, so you're going to have to do some mopping up after whichever one you use.

I've had some success extracting Arabic text from (born digital) PDFs using pdfplumber. By "some success" I mean that it was a huge pain in the... neck, and didn't end up being accurate enough for my purposes. The pain part was because the extracted text was backwards and it had inserted a space next to every diacritic. (I imagine you would need to get rid of the zero-width ligature in Persian either as a pre- or post-processing step.) In Arabic, at least, those were fixable — some code is below.

But the accuracy problem was because I was using a PDF of an Arabic novel that was written in a pretty font where some of the letters are kind of stacked on top of each other. pdfplumber was mostly able to extract what letters were there, but not which order. (Not surprising — this is tough for human students of Arabic as well.) I'm not sure if Persian has the same issue. But if your source is using a plain font you might have better results.

The text in the sample below should read: في رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق الخد الخشبي الضخم علي هيئة قبضة يد نصف مضمومة. الهدوء

import pdfplumber

file = 'sample_page.pdf'
pdf = pdfplumber.open(file)
page = pdf.pages[0]
text = page.extract_text()
print(text[:110])

output:
دّ لخا قوف ترّ قتسا يتلا ةيّ ساحنلا بابلا ةقّ دم تنيّ بت جهنلا سأر في لاإ مٌ يّ مخ ءودلها .ةمومضم فصن دي ةض

^ This is backwards and all there are spaces next to the diacritics

# Reverse text with bidi
from bidi import algorithm

text_rev = algorithm.get_display(text)
print(text_rev[:110])

output:
يف رأس النهج تب ّينت مد ّقة الباب النحاس ّية التي استق ّرت فوق اخل ّد 
اخلشب ّي الضخم عىل هيئة قبضة يد نصف مضم

^ Not backwards anymore, but still the diacritic problem

# Strip most common diacritic — in real use you would need to get all of them
shadda = unichr(0x0651)
text_rev_dediac = text_rev.replace(" "+shadda, '')
print(text_rev_dediac[:110])

output:
يف رأس النهج تبينت مدقة الباب النحاسية التي استقرت فوق اخلد 
اخلشبي الضخم عىل هيئة قبضة يد نصف مضمومة. اهلدوء 

^ This is right, except where the stacked letters are in the wrong order (like the first word is supposed to be في (fy 'in') but instead it's يف (yf). You can see that the period (after the word مضمومة) is still in the correct place, though. So this is pretty suceessful, and might be 100% accurate with an easier font.

Good luck!

0
Masoud Gheisari On

After some tests on different packages, I came to the conclusion that PyPDF2 (currently v3.0.1) works better for RTL languages compared to other packages:

First install the package:

pip install PyPDF2

Then extract the text with the following code:

from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)