Why does os.scandir() slow down/ how to reorganize huge directory?

454 Views Asked by At

I have a directory with 3 million+ files in it (which I should have avoided creating in the first place). Using os.scandir() to simply print out the names,

for f in os.scandir():
    print(f)

takes .004 seconds per item for the first ~200,000 files, but drastically slows down to .3 seconds per item. Upon trying it again, it did the same thing- fast for the first ~200,000, then slowed way down.

After waiting an hour and running it again, this time it was fast for the first ~400,000 files but then slowed down in the same way.

The files all start with a year between 1908 and 1963, so I've tried reorganizing the files using bash commands like

for i in {1908..1963}; do 
> mkdir ../test-folders/$i; 
> mv $i* ../test-folders/$i/; 
> done

But it ends up getting hung up and never making it anywhere...

Any advice on how to reorganize this huge folder or more efficiently list the files in the directory?

2

There are 2 best solutions below

0
Daniel Butler On

It sounds like using an iterator, a function that only returns one item at a time instead of putting everything in memory, would be best.

The glob Library has the function iglob

for infile in glob.iglob( os.path.join(rootdir, '*.*') ):
    …

Documentation: https://docs.python.org/3/library/glob.html#glob.iglob

Related question and answer: https://stackoverflow.com/a/17020892/7838574

0
Wes Hardaker On

oof. That's a lot of files. I'm not sure why python starts slowing down, that is interesting. But there are many reasons why you're having problems. One, directories can be thought of as a special type of file that just holds filenames/data-pointers of all the files in it (grossly simplified). It can be faster at time, along with accessing any file, when the OS is caching some of that information in memory in order to speed up disk access across the system as a whole.

It seems strange that python gets slower, and maybe you're hitting an internal memory or some other mechanism in python.

But let's fix the root of the problem. Your bash script is problematic, because every time you are using a * character you're forcing the bash script to read the entire directory (and likely sort it alphabetically) too. It might be wiser to get the list once and then operate on sections of the list. Maybe something like:

/bin/ls -1 > /tmp/allfiles
for i in {1908..1963}; do
    echo "moving files starting with $i"
    mkdir ../test-folders/$i
    mv $(egrep "^$i" /tmp/allfiles) ../test-folders/$i/
done

this will read the directory only once (sort of) and will inform you about how fast its going.