Why async folder traversal is slower than sequential

92 Views Asked by At

I am benchmarking a folder traversal with aiofiles and normal os python lib. Here is my code: This is the async version of my function

import os
import time
from typing import List
import aiofiles.os as aio_os
import asyncio

subfolders1= set()
async def traverrse_local_path() -> List[str]:  

    async def traverse(local_path, count=None):
        
        try:
            current_dirs_list = [local_path + '/' + f for f in os.listdir(local_path)]
            current_subfolders = [f for f in current_dirs_list if os.path.isdir(f)]

            if not current_subfolders:
                if any([await aio_os.path.isfile(f) for f in current_dirs_list]):
                    subfolders1.add(local_path)
                    return
            for p in current_subfolders:
                    await traverse(p, count)
        except WindowsError:
            pass



    start = time.perf_counter()
    # tasks = [traverse(f.path) for f in current_path_folders]
    path_to_proj = 'C:/Program Files'
    list_folders = [path_to_proj + '/' + f for f in  os.listdir(path_to_proj) if os.path.isdir(path_to_proj + '/' + f)]
    await asyncio.gather(*(traverse(f,count) for count,f in enumerate(list_folders)))
    end = time.perf_counter()
    print(end-start)

loop = asyncio.get_event_loop()
try:
    loop.run_until_complete(traverrse_local_path())
finally:
    loop.close()
print(len(subfolders1))

where this is my normal version - sequential of my function

subfolders_2 = set()
def traverrse_local_path() -> List[str]:
    
    def traverse(local_path, count=None):
        try:
            current_dirs_list = [local_path + '/' + f for f in  os.listdir(local_path)]
            current_subfolders = [f for f in current_dirs_list if os.path.isdir(f)]
            if not current_subfolders:
                if any([os.path.isfile(f) for f in current_dirs_list]):
                    subfolders_2.add(local_path)
                    return
            for p in current_subfolders:
                    traverse(p, count)
        except WindowsError:
            pass

    start = time.perf_counter()
    path_to_proj = 'C:/Program Files'
    traverse(path_to_proj)
    end = time.perf_counter()
    print(end - start)

traverrse_local_path()
print(len(subfolders_2))

and that's the time comparison. enter image description here

Why is that, I thought with async one it should go faster?

2

There are 2 best solutions below

0
Giacomo Catenazzi On BEST ANSWER

Because async has much more overhead, and in your case there is no advantage.

oslistdir() gives you the entire content of the directory, in one go. Then you traverse each directory in it, and with two methods: with async you are expecting that reading some directories may be much slower (and so with CPU wait time) then an other directory inside the same parent directory. So in such wait-time you can get extra job done. The fact: it is a wrong assumption on most cases. From same parent, you should be able to access each directories at same speed. So now you have just overhead.

There may be some cases where your assumption is true: different mounting points inside the traversal, but usually there are not so many mounting points. But if you have a fast and a slow hard-disk, you may see some differences (if you get at beginning such differences). Or in case of network disks, but usually same network same file server, so all packets should take the same time (so async may help just if you have local and network disks).

So: async is good when you have many tasks which should wait to be completed, and the waiting is not caused by access to the same resource. On disks you may get the first, but cache is efficient, and disks are fast, so short wait time which may be used for OS multitasking (so other processes), so not much advantage. But directory traversal very often fails second requirement. So you just have a lot of overhead.

0
jsbueno On

There is little margin for gain in this code: any gain would be mostly based in any previously caching by the OS of certain filesystem subtrees, and hardly be reproducible. Only if, after finding each file in a given subtree running in an independent asyncio task you'd perform some I/O action (like recording some information on the file into a database, or performing an API request with it), asyncio could yield concurrency gains for this kind of task.

As it is, just adding file paths to a set, there is no margin for any asyncio gain: it is a trivial CPU task, and all your asyncio code does is adding the overhead of the asyncio loop (small) and aiofiles (which uses multi-threading under the hood, but limited to one CPU core at once due to Python's GIL, and even if not, the filesystem access would likely by serialized back in the OS Kernel).

There might be some little gain by using simple multi-threading with this code, instead of asyncio, by using a thread pool by concurrent.futures.ThreadPoolExecutor - but maybe, even then, there is no gain at all.

TL;DR: A case of wrong tool for the job. Just go with cync code anyway.