On the wrong foot with regular python reading of a text file, error line at the exception is wrong

Question

On the wrong foot with regular python reading of a text file, error line at the exception is wrong

475 Views Asked by Cornelis At 05 November 2022 at 11:43

When reading an utf-8 text file in Python you may encounter an illegal utf character. Next you probably will try to find the line (number) containing the illegal character, but probably this will fail. This is illustrated by the code below.

Step 1: Create a file containing an illegal utf-8 character (a1 hex = 161 decimal)

filename=r"D:\wrong_utf8.txt"
longstring = "test just_a_text"*10
with open(filename, "wb") as f:
    for lineno in range(1,100):
        if lineno==85:
            f.write(f"{longstring}\terrrocharacter->".encode('utf-8')+bytes.fromhex('a1')+"\r\n".encode('utf-8'))
        else:
            f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))

Step 2: Read the file and catch the error:

print("First pass, regular Python textline read.")
with open(filename, "r",encoding='utf8') as f:
    lineno=0
    while True:
        try:
            lineno+=1
            line=f.readline()
            if not line:
                break
            print(lineno)
        except UnicodeDecodeError:
            print (f"UnicodeDecodeError at line {lineno}\n")
            break

It prints: UnicodeDecodeError at line 50

I would expect the errorline to be line 85. However, lineno 50 is printed! So, the customer who send the file to us was unable to find the illegal character. I tried to find additional parameters to modify the open statement (including buffering) but was unable to get the right error line number.

Note: if you sufficiently shorten the longstring, the problem goes away. So the problem probably has to do with python's internal buffering.

I succeeded by using the following code to find the error line:

print("Second pass, Python byteline read.")
with open(filename,'rb') as f:
    lineno=0
    while True:
        try:
            lineno+=1
            line = f.readline()
            if not line:
                break
            lineutf8=line.decode('utf8')
            print(lineno)
        except UnicodeDecodeError: #Exception as e:
            mybytelist=line.split(b'\t')
            for index,field in enumerate(mybytelist):
                try:
                    fieldutf8=field.decode('utf8')
                except UnicodeDecodeError:
                    print(f'UnicodeDecodeError in line {lineno}, field {index+1}, offending field: {field}!')
                    break
            break

Now it prints the right lineno: UnicodeDecodeError in line 85, field 2, offending field: b'errrocharacter->\xa1\r\n'!

Is this the pythonic way of finding the error line? It works all right but I somehow have the feeling that a better method should be available where it is not required to read the file twice and/or use a binary read.

Original Q&A

There are 2 best solutions below

**Serge Ballesta** · Answer 1 · 2022-11-05T14:11:37.827000

The actual cause is indeed the way Python internally processes text files.They are read in chunks, each chunk is decoded according the the specified encoding, and they if you use readline or iterate the file object, the decoded buffer is split in lines which are returned one at a time.

You can have an evidence of that by examining the UnicodeDecodeError object at the time of the error:

    ....
    except UnicodeDecodeError as e:
        print (f"UnicodeDecodeError at line {lineno}\n")
        print(repr(e)) # or err = e to save the object and examine it later
        break

With your example data, you can find that Python was trying to decode a buffer of 8149 bytes, and that the offending character occurs at position 5836 in that buffer.

This processing is deep inside the Python io library because Text files have to be buffered and the binary buffer is decode before being splitted in lines. So IMHO little can be done here, and the best way is probably your second try: read the file as a binary file and decode the lines one at a time.

Alternatively, you could use errors='replace' to replace any offending byte with a REPLACEMENT CHARACTER (U+FFFD). But then, you would no longer test for an error, but search for that character in the line:

with open(filename, "r",encoding='utf8', errors='replace') as f:
    lineno=0
    while True:
        lineno+=1
        line=f.readline()
        if not line:
            break
        if chr(0xfffd) in line:
            print (f"UnicodeDecodeError at line {lineno}\n")
            break
        print(lineno)

This one also gives as expected:

...
80
81
82
83
84
UnicodeDecodeError at line 85

**ukBaz** · Answer 2 · 2022-11-06T11:37:56.297000

The UnicodeDecodeError has information about the error that can be used to improve the reporting of the error.

My proposal would be to decode the whole file in one go. If the content is good then there is no need to iterate around a loop. Especially as reading a binary file doesn't have the concept of lines.

If there is an error raised with the decode, then the UnicodeDecodeError has the start and end values of the bad content. Only docoding up to the that bad character allows the lines to be counted efficiently with len and splitlines.

If you want to display the bad line then doing the decode with replace errors set might be useful along with the line number from the previous step.

I would also consider raising a custom exception with the new information.

Here is an example:

from pathlib import Path


def create_bad(filename):
    longstring = "test just_a_text" * 10
    with open(filename, "wb") as f:
        for lineno in range(1, 100):
            if lineno == 85:
                f.write(f"{longstring}\terrrocharacter->".encode('utf-8') + bytes.fromhex('a1') + "\r\n".encode('utf-8'))
            else:
                f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))


class BadUnicodeInFile(Exception):
    """Add information about line numbers"""
    pass


def new_read_bad(filename):
    file = Path(filename)
    data = file.read_bytes()
    try:
        file_content = data.decode('utf8')
    except UnicodeDecodeError as err:
        bad_line_no = len(err.object[:err.start].decode('utf8').splitlines())
        bad_line_content = err.object.decode('utf8', 'replace').splitlines()[bad_line_no - 1]
        bad_content = err.object[err.start:err.end]
        raise BadUnicodeInFile(
            f"{filename} has bad content ({bad_content}) on: line number {bad_line_no}\n"
            f"\t{bad_line_content}")
    return file_content


if __name__ == '__main__':
    create_bad("/tmp/wrong_utf8.txt")
    new_read_bad("/tmp/wrong_utf8.txt")

This gave the following output:

Traceback (most recent call last):
  File "/home/user1/stack_overflow/wrong_utf8.py", line 39, in new_read_bad
    file_content = data.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 14028: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user1/stack_overflow/wrong_utf8.py", line 52, in <module>
    new_read_bad("/tmp/wrong_utf8.txt")
  File "/home/user1/stack_overflow/wrong_utf8.py", line 44, in new_read_bad
    raise BadUnicodeInFile(
__main__.BadUnicodeInFile: /tmp/wrong_utf8.txt has bad content (b'\xa1') on: line number 85
    test just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_text    errrocharacter->�

On the wrong foot with regular python reading of a text file, error line at the exception is wrong

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in UTF-8

Related Questions in DECODE

Related Questions in READLINES

Related Questions in LINE-NUMBERS

Trending Questions

Popular # Hahtags

Popular Questions