I use fsspec which uses in-built capabilities of paramiko but could not really find a way how we can paginate the response.
Is there a way to have that functionality over here?
The use-case is like every directory has 100000 files and listing all of these separately in memory is a bad-idea I suppose.
There is a sftp.listdir_iter but do we have that capability in fsspec?
listdir_iterwould provide a more direct way to achieve pagination since it returns an iterator, allowing you to retrieve items one by one.But you could also consider
listdir_attr, which loads all items at once and then slices the list to get the desired page: that would be faster. That mean you can try and implement the pagination by slicing the returned list ofSFTPAttributesobjects. For example:You would use it as:
This approach is slightly more efficient than the one using
listdir_iter, since it avoids iterating through the items one by one.However, it still loads all the
SFTPAttributesobjects in memory before slicing the list. This memory overhead might not be an issue unless you have a very large number of files and limited memory resources.To use
listdir_iterwith fsspec, you can create a customPaginatedSFTPFileSystemclass that inherits fromSFTPFileSystem.The custom class accesses the underlying paramiko SFTP client through the
self.ftpattribute, and then would still use thelistdir_itermethod directly.By accessing the
paramikoSFTP client in this way, you can uselistdir_iterto implement pagination directly, even though it is not part offsspec.Using
sshfs(an implementation of fsspec for the SFTP protocol using asyncssh), I do not see aSSHFS.listdir-like method.But
sshfsalso has a lot of other basic filesystem operations, such asmkdir,touchandfind.You might therefore try and use the
findmethod, which is inherited from theAbstractFileSystemclass infsspec, for pagination:You can use this custom implementation in your project as follows:
This implementation uses the
findmethod with the detail parameter set toFalseto get a list of file paths.Then, it implements pagination by slicing the list of items.
Again, this approach loads all the items into memory before slicing the list, which may be inefficient for very large directories.
I suppose you can pass an existing
SFTPFileSystemobject to your customPaginatedSFTPFileSystemclass and use its underlying sftp connection.To do this, you can modify the custom class to accept an
SFTPFileSystemobject during initialization and use itssftpattribute for listing the directory items.Now you can create an
SFTPFileSystemobject and pass it to thePaginatedSFTPFileSystem:This custom class will now use the sftp connection from the existing
SFTPFileSystemobject, eliminating the need to provide thehost,username, andpasswordagain.Corralien suggests in the comments to use
walk(path, maxdepth=None, topdown=True, **kwargs).You can use this method with your custom
PaginatedSFTPFileSystemclass, as it inherits fromSFTPFileSystem, which in turn inherits fromAbstractFileSystem.This means that the
walkmethod is available to your custom class.However, that might not be the most suitable choice for pagination, as it returns files and directories in a nested structure, making it harder to paginate the results in a straightforward manner.
If you need pagination for only the top-level directories, you can modify the custom
PaginatedSFTPFileSystemclass to include a custom implementation of the walk method with pagination support for the top level.Used with:
Again, that would only paginates the top-level directories and files, not those within the subdirectories.
If you need pagination for files and directories at all levels, consider using the
findmethod or the customlistdir_paginatedmethod, as shown in previous examples.As noted by mdurant in the comments:
See Instance/Listing caching.
Depending on your use-case, you might need to pass
skip_instance_cache=Trueoruse_listings_cache=False.Consider that, if you use the same arguments to create a PaginatedSFTPFileSystem instance, fsspec will return the cached SFTPFileSystem instance.
If you want to force the creation of a new FTP session, you can do so by passing a unique argument when creating the
PaginatedSFTPFileSysteminstance.For example, you can add a dummy argument that takes a unique value each time you want to create a new FTP session:
In that example,
fs1andfs2will have separate FTP sessions, despite having the same host, username, and password, because the unique dummy arguments forcefsspecto create new instances instead of reusing the cached one.