What encoding system/mechanism is used by Python's binary read?

70 Views Asked by At

I need to read a file in binary format, which I do like this:

with open("tex.pdf", mode='br') as file:  
fileContent = file.read()
for i in fileContent:
    print(i,end=" ")

This provides decimal integers, which I think are in ASCII format. However ASCII values cover only 0..127, whereas this output displays integers greater than 127, such as 225, 108, 180, and 193.

Can someone tell me what encoding/mechanism is used?

2

There are 2 best solutions below

0
Corralien On BEST ANSWER

There is no encoding for reading raw bytes so you have to decode yourself with a specified encoding.

Documentation:

Files opened in binary mode (appending 'b' to the mode argument) return contents as bytes objects without any decoding

Example:

# file encoding: utf-16
with open('data.txt', 'rb') as fp:
    buf = fp.read()
    print(buf)

# Output
b'\xff\xfeH\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00\n\x00'
>>> buf.decode('utf-8')
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

>>> buf.decode('utf-16')
'Hello world\n'

The text uses only ascii characters but the encoding is utf-16. I have to decode manually the raw bytes data.

0
Mark Adler On

PDF is a binary format, so all byte values are possible and expected. The PDF format is rather complex. If you want to try to extract the text that is in a PDF, there are many libraries out there. Here is a review of some of them.