h5py: loading large data to HDF5 using chunked storage

147 Views Asked by At

I have 3072 matrices of size 1024x1024, so my dataset looks like 1024x1024x3072. This data amounts to 24 GB, which makes it impossible to load into memory, so I'm looking to use HDF5's chunking storage method in order to operate through chunks which are loadable into memory (128x128x3072) so that I can operate over them. Thing is, my code seems to be extremely inefficient, it takes more than 12 hours to create an HDF5 file of a segment of my data (1024x1024x300). Here's the code I've written so far

with h5py.File("FFT_Heights.h5", "w") as f: 
   dset = f.create_dataset( "chunked", (1024, 1024, 300), chunks=(128, 128, 300), dtype='complex128'
   ) 
   for ii in tqdm(range(300)):
       dset[ii] = np.load(f'K field {ii}.npy').astype('complex128')

As you can see in my example code, I'm only taking 300 out of the 3072 matrices, this is because I'm trying to make sure the code works for a smaller dataset before running the whole data. Also, do have in mind that my data is complex, and the imaginary part must not be compromised while creating the file, so I'm setting the dtype beforehand. So, bottomline, the problem is the writing speed. The generated HDF5 file is constructed properly, I've checked so, but the issue is that I need to run this code for 3072 images, and I would like to know if there's a way to make this file creation more efficient (I've also tried different chunk sizes but got the same results regarding writing speed). Lastly, I'm working on Python. Thanks in advance!

2

There are 2 best solutions below

6
kcw78 On

You need to modify the chunk size. It's the wrong size and shape.

  1. First, it's too large. Recommended chunk size range is between 10 KiB and 1 MiB (larger for larger datasets). By my calculations, yours is 77 Mib.
  2. More importantly, it's the wrong shape for loading (and reading) your image arrays. Each image is 1024x1024, and you are loading 1 at a time. With a chunk shape of (128, 128, 300), you will have to write to 64 chunks (8x8 chunks).

I modified chunks to chunks=(1024, 1024, 1). This matches the shape of 1 image, so you write or read 1 chunk each time you access an image. Also, it reduces the chunk size to 17 MiB.

I ran a test loading 400 complex128 npy files. It runs in 33 seconds on a (very) old Windows workstation w/ 24 GB RAM. Note: load time is not linear. The first 250 files load quickly (25 npy files in 0.7 sec). Files 250-400 take longer (25 npy files in 3.6 sec).

Note: I had to modify the dataset indexing on the line the loads the npy files to the dataset. I'm not sure how/why your indexing worked. Maybe the broadcasting worked for you. See code below:

cnt = 400
with h5py.File("FFT_Heights.h5", "w") as h5f: 
   dset = h5f.create_dataset("chunked", (1024, 1024, cnt), 
                             chunks=(1024, 1024, 1), dtype='complex128')     
   total = time.time()
   for ii in range(cnt):
       dset[:,:,ii] = np.load(f'K field {ii}.npy')    
       
print(f'Total elapsed time={time.time()-total:.2f}')       
0
kcw78 On

This is an updated procedure based on your comments about needing to read the HDF5 data on the last axis. As mentioned in my comments, this creates a dilemma setting the chunk shape. The optimal chunk shape is different for loading the data as image slices and slicing the data thru the images. So, the "optimal" chunk shape has to balance the 2 processes. In other words, if you modify the chunk shape to improve read performance, it will degrade the load performance in the read step (and vice versa). However, there is a way to work around the I/O bottleneck with a small modification to the load procedure.

Your procedure calls np.load() to read each .npy file and loads it directly into the HDF5 dataset. Instead, create a NumPy array (sized to fit the number of images on the 3rd chunk axis), read enough .npy files to fill the array, and finally add it to the HDF5 dataset. Then repeat. :-) I ran some tests, and this process loads AND reads the data faster than the original process with either chuck shape. Timing data to read 300 .npy files:

  1. Timing data for chunks=(1024, 1024, 1) [my first answer]

    • Elapsed time to load data files = 18.59 sec
    • Elapsed time to read dataset slices = 934.50 sec
  2. Timing data for chunks=(256, 256, 15)

    • Elapsed time to load data files = 1533.21
    • Elapsed time to read dataset slices = 64.29
  3. Timing data for chunks=(256, 256, 15) with modified load process

    • Elapsed time to load data files = 20.73
    • Elapsed time to read dataset slices = 32.06

Code below.
(Note, it reads 300 .npy files in sets of 15, so the last loop matches the increment. It will take some tweaking for the general case -- or pick an increment that is a factor of the number of images, say 16 for 3072).

a0, a1 = 1024, 1024
c0, c1, c2 = 256, 256, 15
cnt = 300

with h5py.File("FFT_Heights_2.h5", "w") as h5f: 
    dset = h5f.create_dataset("chunked", (a0, a1, cnt), 
                              chunks=(c0, c1, c2), dtype='complex128')     
    total = time.time()
    incr = time.time()
    arr_k = 0
    arr = np.empty(shape=(a0, a1, c2),dtype='complex128')
    for ii in range(cnt):
        arr[:,:,arr_k] = np.load(f'K field {ii}.npy')
        arr_k += 1
        if arr_k == c2:
            dset[:,:,ii+1-c2:ii+1] = arr
            arr_k = 0
            arr = np.empty(shape=(a0, a1, c2),dtype='complex128')
        if (ii+1) % 25 == 0:
            print(f'finished array {ii}; elapsed time={time.time()-incr:.2f}')
            incr = time.time()
       
print(f'Elapsed time to load data files = {time.time()-total:.2f}')       

total = time.time()
with h5py.File("FFT_Heights_2.h5") as h5f: 
    dset = h5f["chunked"]
    for i in range(4):
        print(f'i={i}; ',end='')
        for j in range(4):
            print(f' j={j}... ',end='')
            cplx_arr = dset[256*i:256*(i+1),256*j:256*(j+1),:]
        print('')    
print(f'Elapsed time to read dataset slices = {time.time()-total:.2f}')