Read data with h5py slow due to `make_fid`

125 Views Asked by At

As a part of my data processing pipeline I'm reading many hdf files on a network drive, potentially away from the physical machine. After profiling (using cProfile) my code which does basically the following:

data = []
for path in paths:
    with h5py.File(path, 'r') as hdf:
        data.append(hdf['dataset'][()])
return data

I found that there are two main calls in this loop: h5py.File.__init__ (which dispatches to make_fid internally) and File.__getitem__ (which dispatches to method 'read' of h5py._selector.Blahblah). Now, make_fid takes almost as much time as __getitem__ itself when reading from a far away drive and drops to almost negligible when reading files that were moved to a local SSD, while __getitem__ runtime remains almost constant (in terms of time per call). I am no OS guy so I would like to ask what exactly should contribute to this slowing down: is it plain network transfer, some filesystem operations/synchronization, or something else entirely? Network would be the most likely culprit but I have two issues with that explanation:

  1. Shouldn't it contribute to __getitem__ which executes read, rather than to instantiation of the File object?
  2. Using the method from here to benchmark network transfer from my VM to two different non-local drives I found that they had almost 3x different in read throughput, but this translates to barely ~20% speedup of one over the other when executing the code above.
0

There are 0 best solutions below