As a part of my data processing pipeline I'm reading many hdf files on a network drive, potentially away from the physical machine. After profiling (using cProfile) my code which does basically the following:
data = []
for path in paths:
with h5py.File(path, 'r') as hdf:
data.append(hdf['dataset'][()])
return data
I found that there are two main calls in this loop: h5py.File.__init__ (which dispatches to make_fid internally) and File.__getitem__ (which dispatches to method 'read' of h5py._selector.Blahblah). Now, make_fid takes almost as much time as __getitem__ itself when reading from a far away drive and drops to almost negligible when reading files that were moved to a local SSD, while __getitem__ runtime remains almost constant (in terms of time per call). I am no OS guy so I would like to ask what exactly should contribute to this slowing down: is it plain network transfer, some filesystem operations/synchronization, or something else entirely? Network would be the most likely culprit but I have two issues with that explanation:
- Shouldn't it contribute to
__getitem__which executesread, rather than to instantiation of theFileobject? - Using the method from here to benchmark network transfer from my VM to two different non-local drives I found that they had almost 3x different in read throughput, but this translates to barely ~20% speedup of one over the other when executing the code above.