When reading an utf-8 text file in Python you may encounter an illegal utf character. Next you probably will try to find the line (number) containing the illegal character, but probably this will fail. This is illustrated by the code below.
Step 1: Create a file containing an illegal utf-8 character (a1 hex = 161 decimal)
filename=r"D:\wrong_utf8.txt"
longstring = "test just_a_text"*10
with open(filename, "wb") as f:
for lineno in range(1,100):
if lineno==85:
f.write(f"{longstring}\terrrocharacter->".encode('utf-8')+bytes.fromhex('a1')+"\r\n".encode('utf-8'))
else:
f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))
Step 2: Read the file and catch the error:
print("First pass, regular Python textline read.")
with open(filename, "r",encoding='utf8') as f:
lineno=0
while True:
try:
lineno+=1
line=f.readline()
if not line:
break
print(lineno)
except UnicodeDecodeError:
print (f"UnicodeDecodeError at line {lineno}\n")
break
It prints: UnicodeDecodeError at line 50
I would expect the errorline to be line 85. However, lineno 50 is printed! So, the customer who send the file to us was unable to find the illegal character. I tried to find additional parameters to modify the open statement (including buffering) but was unable to get the right error line number.
Note: if you sufficiently shorten the longstring, the problem goes away. So the problem probably has to do with python's internal buffering.
I succeeded by using the following code to find the error line:
print("Second pass, Python byteline read.")
with open(filename,'rb') as f:
lineno=0
while True:
try:
lineno+=1
line = f.readline()
if not line:
break
lineutf8=line.decode('utf8')
print(lineno)
except UnicodeDecodeError: #Exception as e:
mybytelist=line.split(b'\t')
for index,field in enumerate(mybytelist):
try:
fieldutf8=field.decode('utf8')
except UnicodeDecodeError:
print(f'UnicodeDecodeError in line {lineno}, field {index+1}, offending field: {field}!')
break
break
Now it prints the right lineno: UnicodeDecodeError in line 85, field 2, offending field: b'errrocharacter->\xa1\r\n'!
Is this the pythonic way of finding the error line? It works all right but I somehow have the feeling that a better method should be available where it is not required to read the file twice and/or use a binary read.
The actual cause is indeed the way Python internally processes text files.They are read in chunks, each chunk is decoded according the the specified encoding, and they if you use
readlineor iterate the file object, the decoded buffer is split in lines which are returned one at a time.You can have an evidence of that by examining the
UnicodeDecodeErrorobject at the time of the error:With your example data, you can find that Python was trying to decode a buffer of 8149 bytes, and that the offending character occurs at position 5836 in that buffer.
This processing is deep inside the Python io library because Text files have to be buffered and the binary buffer is decode before being splitted in lines. So IMHO little can be done here, and the best way is probably your second try: read the file as a binary file and decode the lines one at a time.
Alternatively, you could use
errors='replace'to replace any offending byte with a REPLACEMENT CHARACTER (U+FFFD). But then, you would no longer test for an error, but search for that character in the line:This one also gives as expected: