How to read the contents of pdf files which encoding is `none`?

217 Views Asked by At

Upd: solved, see the comments below.

When I try to read the contents of some pdf files I get an empty string. I have noticed that this happens to pdf files which encoding is none, and it works fine for pdf files which are identified as base64. The other suspect is the size of the file, perhaps pygithub fails to read big files. Obviously, without reading the file I cannot apply OCR.

This happens when I read the entire directories on github and copy them to another cloud storage. I don't have a fixation on any pdf file in particular.

The alternative to pygithub is REST API called through requests package, I will try it later.

Pdf file I used is this one, and it's the same with other pdf files that use languages with special characters.

from github import Github

github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_raw = repo.get_contents("20200910-BETA8-ROTULACION-INTERIOR-BOCETO-final.pdf")
print(cont_raw.size, len(cont_raw.content), cont_raw.encoding) 
# output: 1283429 0 none
2

There are 2 best solutions below

4
johnwhitington On

This PDF file does not contain any text or fonts. What looks like text is just ordinary PDF filled shapes.

So, you have no choice but to rasterize and OCR.

In this particular example, it has nothing to do with the language or "encoding" in use.

1
Joris Schellekens On

To understand your problem, you need to understand PDF in a bit more detail.

You see, PDF is not a WYSIWYG (what you see is what you get) format. If we were to look at an .html document, you'd recognize the text of the page, and you'd be able to derive other information such as:

  • These characters belong together in a paragraph
  • These paragraphs make up a column in a row, in a table
  • etc

PDF is more like a programming language. Inside a PDF you'll find a special kind of datastructure (called a stream) that represents the contents of a Page.

Each of these content streams is essentially a compressed piece of text, representing postscript (a programming language) instructions.

In pseudo-code, you might find things like:

  • go to position 40, 450
  • set the stroke color to black
  • set the font to Helvetica, size 12
  • render the character with ID 12
  • go to position 45, 450
  • etc

now imagine that in stead of using a Font, your instructions would be somewhat like:

  • go to position 40, 450
  • set the stroke color to black
  • stroke the following path: .... (which happens to render 'H')
  • etc

Because there is no Font, and no character ID (abbreviated as cid), there is no way of knowing what the underlying text is. The only thing any reader/parser software would see is "this page contains some vector graphics".

You have images not text.

The best way forward would be to convert your entire PDF to images (perhaps using a tool such as ghostscript) and then apply OCR to the resulting image.