Upd: solved, see the comments below.
When I try to read the contents of some pdf files I get an empty string. I have noticed that this happens to pdf files which encoding is none, and it works fine for pdf files which are identified as base64. The other suspect is the size of the file, perhaps pygithub fails to read big files. Obviously, without reading the file I cannot apply OCR.
This happens when I read the entire directories on github and copy them to another cloud storage. I don't have a fixation on any pdf file in particular.
The alternative to pygithub is REST API called through requests package, I will try it later.
Pdf file I used is this one, and it's the same with other pdf files that use languages with special characters.
from github import Github
github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_raw = repo.get_contents("20200910-BETA8-ROTULACION-INTERIOR-BOCETO-final.pdf")
print(cont_raw.size, len(cont_raw.content), cont_raw.encoding)
# output: 1283429 0 none
This PDF file does not contain any text or fonts. What looks like text is just ordinary PDF filled shapes.
So, you have no choice but to rasterize and OCR.
In this particular example, it has nothing to do with the language or "encoding" in use.