Has anyone used AWS Textract to add OCR text to PDFs in Python?

678 Views Asked by At

I'm exploring options for semi-automated redaction of PDFs using various NLP techniques, and have been using PyMuPDF with Tesseract via ocrmypdf for OCR. This works pretty well overall, but management want to try Textract as an alternative. It's easy enough to call it against a single page of a PDF and read the resulting dictionary, but there's no simple way (that I've found yet) for mapping that back into the PDF as invisible text to create a searchable version of the page (all of which ocrmypdf does automatically).

For reference, here's an example of the dict that Textract produces. A given entry can be either a WORD or LINE.

'Id': 'be018daa-02c9-47d2-903a-73b69bdaa181',
             'Text': "owners'",
             'TextType': 'PRINTED'},
            {'BlockType': 'WORD',
             'Confidence': 95.73345947265625,
             'Geometry': {'BoundingBox': {'Height': 0.014128071255981922,
                                          'Left': 0.7538964748382568,
                                          'Top': 0.7295616269111633,
                                          'Width': 0.08705723285675049},

                          'Polygon': [{'X': 0.7539187669754028,
                                       'Y': 0.7295616269111633},
                                      {'X': 0.8409537076950073,
                                       'Y': 0.7295762896537781},
                                      {'X': 0.8409309983253479,
                                       'Y': 0.7436897158622742},
                                      {'X': 0.7538964748382568,
                                       'Y': 0.7436745166778564}]},

Has anyone done this in Python, or have suggestions?

I'm working through various options. One mechanism I was thinking of was using the polygon coordinates provided for each LINE or WORD to create a new PyMuPDF Rect, then calling insertTextbox() against that rectangle.

But then there's the problem of font size/face and making sure it all aligns, which means identifying what font was detected and its size.

We also have the problem that our PDFs come from a variety of uncontrolled sources, and can variously contain 100% searchable, 100% image-only, or a mix of page types. And they can be produced by a whole range of applications, so there's no single option that will likely cover everything.

1

There are 1 best solutions below

3
Jorj McKie On

I have done that many times using PyMuPDF. There are a few things to watch out for:

  1. Textract recognizes no fonts - so you have to decide which one to take for your insertions
  2. Textract delivers bboxes of lines and words, no fontsize. You have to compute the one that causes fitting the text in the (recomputed) bbox on output
  3. Textract coordinates are all between 0 and 1. You need your original page dimension to transform Textract coordinates to output coordinates.

Once you have solutions for the above (using PyMuPDF makes it fairly simple), insert text to your output page using page.insert_text() in PyMuPDF with render mode 3: this causes the text to be invisible.

For point 3 above use a PyMuPDF rectangle method: matrix = fitz.Rect(0, 0, 1, 1).torect(page.rect). If you then take a Textract boundary box, make a PyMuPDF-compatible rectangle of it with top-left coordinates (x0, y0) and bottom-left coordinates (x1, y1): textract_rect = fitz.Rect(x0, y0, x1, y1). Then the following gives you the corresponding bbox on your output page: bbox = textreact_rect * matrix.

Suggest you use font Helvetica for output: font = fitz.Font("helv").

If you have your text and its output bbox, compute the font size like this: textlen = font.text_length(text,fontsize=1) to get output length if fontsize where 1. Then bbox.width / textlen should give you a good value for the fontsize to take.

Next problem is the insertion point (needed for page.insert_text()).

bbox.bl (bottom left point) is a good start, but if your text contains characters descending below the base line (e.g. g, y, etc.), you need to adjust the insertion point upwards a little. Use font.descender and computed fontsize to compute this.