I have 3072 matrices of size 1024x1024, so my dataset looks like 1024x1024x3072. This data amounts to 24 GB, which makes it impossible to load into memory, so I'm looking to use HDF5's chunking storage method in order to operate through chunks which are loadable into memory (128x128x3072) so that I can operate over them. Thing is, my code seems to be extremely inefficient, it takes more than 12 hours to create an HDF5 file of a segment of my data (1024x1024x300). Here's the code I've written so far
with h5py.File("FFT_Heights.h5", "w") as f:
dset = f.create_dataset( "chunked", (1024, 1024, 300), chunks=(128, 128, 300), dtype='complex128'
)
for ii in tqdm(range(300)):
dset[ii] = np.load(f'K field {ii}.npy').astype('complex128')
As you can see in my example code, I'm only taking 300 out of the 3072 matrices, this is because I'm trying to make sure the code works for a smaller dataset before running the whole data. Also, do have in mind that my data is complex, and the imaginary part must not be compromised while creating the file, so I'm setting the dtype beforehand. So, bottomline, the problem is the writing speed. The generated HDF5 file is constructed properly, I've checked so, but the issue is that I need to run this code for 3072 images, and I would like to know if there's a way to make this file creation more efficient (I've also tried different chunk sizes but got the same results regarding writing speed). Lastly, I'm working on Python. Thanks in advance!
You need to modify the chunk size. It's the wrong size and shape.
I modified chunks to
chunks=(1024, 1024, 1). This matches the shape of 1 image, so you write or read 1 chunk each time you access an image. Also, it reduces the chunk size to 17 MiB.I ran a test loading 400 complex128 npy files. It runs in 33 seconds on a (very) old Windows workstation w/ 24 GB RAM. Note: load time is not linear. The first 250 files load quickly (25 npy files in 0.7 sec). Files 250-400 take longer (25 npy files in 3.6 sec).
Note: I had to modify the dataset indexing on the line the loads the npy files to the dataset. I'm not sure how/why your indexing worked. Maybe the broadcasting worked for you. See code below: