PDFtron low level text extractore gly issue

73 Views Asked by At

Base problem: The PDFTron lib font.MapToUnicode return wrong case character.

Details: This is happens in particular book the snap are attached below, there are few character are getting in lower case but char.char_code is for upper case. as per my knowledge the font character and gly mapping having a problem. please go through the code and file and let me help in this case

pdf_file_prob_char

environment: PDFNetPython3 lib vr : 9.4.2

Original PDF has some capital letters, but we get small letter as character PDF text is capital 'O' but PDFTron text-extractor gives small 'o'

snap of pdf :

enter image description here

code :

from PDFNetPython3.PDFNetPython import PDFNet, PDFDoc, ElementReader, Element, Point
from PDFNetPython3.PDFNetPython import Font, GState, ColorSpace, PatternColor, PathData

class CharFromPDF:

def __init__(self):
    pass

def print_char_from_pdf(self, pdf_file_path):
    PDFNet.Initialize("demo:1691991990538:7c56930b030000000055aed6bf8e4eb6a00bb237070a3797ee21cafe95")
    doc = PDFDoc(pdf_file_path)
    doc.InitSecurityHandler()
    page_begin = doc.GetPageIterator()
    page_reader = ElementReader()

    itr = page_begin
    while itr.HasNext():
        page_reader.Begin(itr.Current())
        self.process_elements(page_reader)
        page_reader.End()
        itr.Next()
    doc.Close()
    PDFNet.Terminate()
    print("Done.")

def process_path(self, reader, path):
    gs = path.GetGState()
    gs_itr = reader.GetChangesIterator()
    while gs_itr.HasNext():
        if gs_itr.Current() == GState.e_fill_color:
            if (gs.GetFillColorSpace().GetType() == ColorSpace.e_pattern and
                    gs.GetFillPattern().GetType() != PatternColor.e_shading):
                reader.PatternBegin(True)
                self.process_elements(reader)
                reader.End()
        gs_itr.Next()
    reader.ClearChangeList()

def process_text(self, page_reader):
    # Begin text element
    element = page_reader.Next()
    while element is not None:
        element_type = element.GetType()
        if element_type == Element.e_text_end:
            return
        elif element_type == Element.e_text:
            gs = element.GetGState()
            font = gs.GetFont()
            if font.GetType() == Font.e_Type3:
                itr = element.GetCharIterator()
                while itr.HasNext():
                    page_reader.Type3FontBegin(itr.Current())
                    self.process_elements(page_reader)
                    page_reader.End()
            else:
                itr = element.GetCharIterator()
                while itr.HasNext():
                    char_code = itr.Current().char_code
                    a = font.MapToUnicode(char_code)
                    print("Char: ", a[0], " ascii code: ", ascii(a[0]), "char_code", char_code,
                          " Font Name: ", font.GetName())
                    itr.Next()
            print("")
        element = page_reader.Next()

def process_elements(self, reader):
    element = reader.Next()
    while element is not None:
        element_type = element.GetType()
        if element_type == Element.e_path:
            self.process_path(reader, element)
        elif element_type == Element.e_text_begin:
            self.process_text(reader)
        elif element_type == Element.e_form:
            reader.FormBegin()
            self.process_elements(reader)
            reader.End()
        element = reader.Next()
if __name__ == "__main__":
    cfp = CharFromPDF()
    input_file_path = "text_issue.pdf"
    cfp.print_char_from_pdf(pdf_file_path=input_file_path)

In above example you find the font.MaptoUnicode the character code for "o" is capital case but function return small case letter

we try the textextracter as well from same lib but the return vise versa out but as text.

1

There are 1 best solutions below

2
iPDFdev On

The font used to display the text has a ToUnicode cmap that maps 'O' (upper case O) to 'o' (lower case o).
PDF specification says that when extracting text from PDF, the ToUnicode cmap should be considered first and then the font's encoding.
It seems that Acrobat ignores the ToUnicode cmap in favor of the font's WinAnsi encoding. Even after fixing the cmap's code space range Acrobat still ignores it, so this might be Acrobat's particular behavior with WinAnsi encoding (not compliant with PDF specification).
Other PDF readers such as SumatraPDF use the ToUnicode cmap for text extraction so their output is the same as PDFTron.