What encoding system/mechanism is used by Python's binary read?

70 Views Asked by Dileesha abilash At 13 February 2023 at 07:05

I need to read a file in binary format, which I do like this:

with open("tex.pdf", mode='br') as file:  
fileContent = file.read()
for i in fileContent:
    print(i,end=" ")

This provides decimal integers, which I think are in ASCII format. However ASCII values cover only 0..127, whereas this output displays integers greater than 127, such as 225, 108, 180, and 193.

Can someone tell me what encoding/mechanism is used?

Original Q&A

There are 2 best solutions below

Corralien On 13 February 2023 at 07:10 BEST ANSWER

There is no encoding for reading raw bytes so you have to decode yourself with a specified encoding.

Documentation:

Files opened in binary mode (appending 'b' to the mode argument) return contents as bytes objects without any decoding

Example:

# file encoding: utf-16
with open('data.txt', 'rb') as fp:
    buf = fp.read()
    print(buf)

# Output
b'\xff\xfeH\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00\n\x00'

>>> buf.decode('utf-8')
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

>>> buf.decode('utf-16')
'Hello world\n'

The text uses only ascii characters but the encoding is utf-16. I have to decode manually the raw bytes data.

Mark Adler On 13 February 2023 at 20:46

PDF is a binary format, so all byte values are possible and expected. The PDF format is rather complex. If you want to try to extract the text that is in a PDF, there are many libraries out there. Here is a review of some of them.

What encoding system/mechanism is used by Python's binary read?

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in BINARY

Related Questions in COMPRESSION

Related Questions in THEORY

Related Questions in LOSSLESS-COMPRESSION

Trending Questions

Popular # Hahtags

Popular Questions