How to read the contents of pdf files which encoding is `none`?

217 Views Asked by Yulia V At 30 June 2023 at 16:14

Upd: solved, see the comments below.

When I try to read the contents of some pdf files I get an empty string. I have noticed that this happens to pdf files which encoding is none, and it works fine for pdf files which are identified as base64. The other suspect is the size of the file, perhaps pygithub fails to read big files. Obviously, without reading the file I cannot apply OCR.

This happens when I read the entire directories on github and copy them to another cloud storage. I don't have a fixation on any pdf file in particular.

The alternative to pygithub is REST API called through requests package, I will try it later.

Pdf file I used is this one, and it's the same with other pdf files that use languages with special characters.

from github import Github

github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_raw = repo.get_contents("20200910-BETA8-ROTULACION-INTERIOR-BOCETO-final.pdf")
print(cont_raw.size, len(cont_raw.content), cont_raw.encoding) 
# output: 1283429 0 none

Original Q&A

There are 2 best solutions below

johnwhitington On 30 June 2023 at 18:24

This PDF file does not contain any text or fonts. What looks like text is just ordinary PDF filled shapes.

So, you have no choice but to rasterize and OCR.

In this particular example, it has nothing to do with the language or "encoding" in use.

Joris Schellekens On 30 June 2023 at 22:37

To understand your problem, you need to understand PDF in a bit more detail.

You see, PDF is not a WYSIWYG (what you see is what you get) format. If we were to look at an .html document, you'd recognize the text of the page, and you'd be able to derive other information such as:

These characters belong together in a paragraph
These paragraphs make up a column in a row, in a table
etc

PDF is more like a programming language. Inside a PDF you'll find a special kind of datastructure (called a stream) that represents the contents of a Page.

Each of these content streams is essentially a compressed piece of text, representing postscript (a programming language) instructions.

In pseudo-code, you might find things like:

go to position 40, 450
set the stroke color to black
set the font to Helvetica, size 12
render the character with ID 12
go to position 45, 450
etc

now imagine that in stead of using a Font, your instructions would be somewhat like:

go to position 40, 450
set the stroke color to black
stroke the following path: .... (which happens to render 'H')
etc

Because there is no Font, and no character ID (abbreviated as cid), there is no way of knowing what the underlying text is. The only thing any reader/parser software would see is "this page contains some vector graphics".

You have images not text.

The best way forward would be to convert your entire PDF to images (perhaps using a tool such as ghostscript) and then apply OCR to the resulting image.

How to read the contents of pdf files which encoding is `none`?

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in GITHUB

Related Questions in ENCODING

Related Questions in PYTHON-REQUESTS

Related Questions in PYGITHUB

Trending Questions

Popular # Hahtags

Popular Questions