Python Watchdog library update snapshot with chosen files

40 Views Asked by At

I am using Python 3.11 on Windows 11

I have an app that watches a directory for changes. If there is a change, it uploads any new or modified files to Azure Blob Storage (after performing some preprocessing), and deletes from Blob Storage any files that were deleted locally.
This is all done on a schedule (e.g. everyday at 12am) and I've been doing the directory watching with the Python library watchdog.

The class looks like this:

from watchdog.utils.dirsnapshot import DirectorySnapshot, DirectorySnapshotDiff

class Watcher():
    @abstractmethod
    def create_snapshot(self) -> None:
        pass 

    @abstractmethod
    def update_snapshot(self) -> None:
        pass 
    
    @abstractmethod
    def get_deleted_files(self) -> Iterator[str]:
        pass 
    
    @abstractmethod
    def get_created_files(self) -> Iterator[str]:
        pass 
    
    @abstractmethod
    def get_modified_files(self) -> Iterator[str]:
        pass

class DirectoryWatcher(Watcher):
    def __init__(self, directory : str, snapshot_filepath : str):
        """
        A class for facilitating directory watching
        Args:
            directory (str): The directory to watch
            snapshot_filepath (str): The path to where the snapshot will be saved
        """
        self.directory : str = directory                    
        self.snapshot_filepath : str = snapshot_filepath
        self.snapshot_diff : DirectorySnapshotDiff = None
    
    def create_snapshot(self) -> None:
        """ Creates a snapshot and saves it in the snapshot_filepath. """
        snapshot = DirectorySnapshot(self.directory,recursive=True)
        with open(self.snapshot_filepath, "wb") as file:
            pickle.dump(snapshot,file)

    def update_snapshot(self) -> None:
        """ Updates the snapshot_filepath with the current directory. """
        self.create_snapshot()
        
    def get_snapshot_diff(self) -> DirectorySnapshotDiff:
        """ Returns the DirectorySnapshotDiff of the stored snapshot (in snapshot_filepath) and a snapshot of the watched directory. The result is cached. """
        if self.snapshot_diff is None:
            prev_snapshot : DirectorySnapshot = self.get_snapshot()
            curr_snapshot : DirectorySnapshot = DirectorySnapshot(self.directory,recursive=True)
            self.snapshot_diff = DirectorySnapshotDiff(prev_snapshot, curr_snapshot)
        return self.snapshot_diff
     
    def get_snapshot(self) -> DirectorySnapshot:
        """ Returns the DirectorySnapshot saved in snapshot_filepath """
        with open(self.snapshot_filepath, 'rb') as file:
            return pickle.load(file)

    def get_deleted_files(self) -> list[str]:
        """ Returns the list of files that were deleted. """
        self.get_snapshot_diff()
        return self.snapshot_diff.files_deleted
    
    def get_created_files(self) -> list[str]:
        """ Returns the list of files that were created. """
        self.get_snapshot_diff()
        return self.snapshot_diff.files_created
    
    def get_modified_files(self) -> list[str]:
        """ Returns the list of files that were modified. """
        self.get_snapshot_diff()
        return self.snapshot_diff.files_modified

I am using the snapshot to get the difference in directory in order to know which files to create, delete, etc in Blob storage. As of now, the snapshot is updated only after the files are uploaded to Blob Storage.

An issue however is that this update doesn't guarantee atomicity nor consistency -- if my app is interrupted mid-upload, then half of the directory will be uploaded but the snapshot will not have updated. This results in double uploading the next time the program is run.

I was wondering if watchdog has a utility to update the snapshot file-by-file. i.e. rather than wait for all the documents to be uploaded before updating the snapshot, I update the snapshot with the filename as it has been uploaded. Or am I getting my hopes up?

I have tried to implement my own code for file watching but I could never match the performance of watchdog.

0

There are 0 best solutions below