How to detect a file has been truncated while reading

2.3k Views Asked by At

I'm reading lines from a group of files (log files) following them as they are written using pyinotify.

I'm opening and reading the files with python native methods:

file = open(self.file_path, 'r')
# ... later
line = file.readline()

This is generally stable and can handle the file being deleted and re-created. pyinotify will notify the unlink and subsequent link.

However some log files are not being deleted. Instead they are being truncated and new content written to the beginning of the same file.

I'm having trouble reliably detecting when this has occurred since pyinotify will simply report only a write. The only evidence I currently get is that pyinotify reports a write and readline() returns an empty string. BUT, it is possible that two subsiquent writes could trigger the same behavior.

I have thought of comparing a file's size to file.tell() but according to the documentation tell produces an opaque number and it appears this can't be trusted to be a number of bytes.

Is there a simple way to detect a file has been truncated while reading from it?


Edit:

Truncating a file can be simulated with simple shell commands:

echo hello > test.log
echo hello >> test.log
# Truncate test.log
echo goodbye > test.log

To compliment this, a simple python script can be used to confirm that file.tell() does not reduce when the file is truncated:

foo = open('./test.log', 'r')
line = foo.readline()
while line != '':
    print(foo.tell())
    print(line)
    line = foo.readline()

# Put a breakpoint on the following line and 
# truncate the file before it executes
print(foo.tell())
1

There are 1 best solutions below

2
Davis Herring On

Use os.lseek(file.fileno(),0,os.SEEK_CUR) to obtain a byte offset without moving the file pointer. You can’t really use the regular file interface to find out, not least because it may have buffered text (that no longer exists) that it hasn’t made visible to Python yet. If the file is not a byte stream (e.g., the default open in Python 3), it could even be in the middle of a multibyte character and be unable to proceed even if the file immediately grew back past your file offset.