HDF5 files and plotting using chunks

291 Views Asked by At

I'm new to HDF5 files and I don't understand how to access chunks in a dataset. I have quite a big dataset (1536, 2048, 11, 18, 2) which is chunked into (768, 1024, 1,1,1), each chunk represents half of an image. I want to plot the dataset, giving the mean values of each (whole) image (using matplotlib).

Question: how to I access separate chunks and how do I work with them? (How does h5py use them?)

This is my code:

bla = np.random.randint(0,100, (1536, 2048, 11, 18, 2))

with h5py.File('test.h5','w') as f:
    grp = f.create_group('Measurement 1')
    grp.create_dataset('data', data = bla, chunks = (768,1024,1,1,1))

f.close()

I have this to get access to the dataset, but I don't know how to access the chunks..

with h5py.File('test.h5', 'r') as hf:
            for dset in hf['Measurement 1'].keys():      
                print (dset)
                ds_hf = hf['Measurement 1']['data'] # returns HDF5 dataset object
                print (ds_hf)
                print (ds_hf.shape, ds_hf.dtype)
                data_f = hf['Measurement 1']['data'][:] # adding [:] returns a numpy array
hf.close()

I need the program to open each chunk, get the mean value and close it again before opening the next one, so my RAM doesn't get full.

2

There are 2 best solutions below

2
Omid Roshani On BEST ANSWER

Here is a sample code that you can understand how chunks work in hdf5, I developed it in a general way, you can modify it based on you requirements:

import h5py
import numpy as np

# Generate random data
bla = np.random.randint(0, 100, (1536, 2048, 11, 18, 2))

# Create the HDF5 file and dataset
with h5py.File('test.h5', 'w') as f:
    grp = f.create_group('Measurement 1')
    grp.create_dataset('data', data=bla, chunks=(768, 1024, 1, 1, 1))

# Open the HDF5 file
with h5py.File('test.h5', 'r') as hf:
    # Access the dataset
    ds_hf = hf['Measurement 1']['data']
    print(ds_hf)
    print(ds_hf.shape, ds_hf.dtype)

    # Iterate over the chunks
    for chunk_idx in np.ndindex(ds_hf.chunks):
        chunk = ds_hf[chunk_idx]
        # Process the chunk
        chunk_mean = np.mean(chunk)
        print(f"Chunk {chunk_idx}: Mean value = {chunk_mean}")

# Close the HDF5 file
hf.close()
0
kcw78 On

Chunks are used to optimize I/O performance. HDF5 (and h5py) write/read data in chunked blocks (1 chunk at a time). This is handled in the background, and you do not have to worry about the chunking mechanism. The chunk size/shape is defined when you create the dataset, and cannot be changed. If you need to change it, there are HDF5 utilities to do this.

When reading data you don't have to worry about chunksize (in general). **See comments at end for more details. Use Numpy slice notation to read the desired slice, and h5py/HDF5 will read for you. YOU DO NOT HAVE TO WRITE YOUR CODE TO READ EXACTLY 1 CHUNK AT A TIME.

Assuming axis 0 is the image index, the code below will read each image array to the image object (as a numpy array). It's much easier and cleaner than working with the chunk objects.

with h5py.File('test.h5', 'r') as hf:
    ds_hf = hf['Measurement 1']['data'] # returns HDF5 dataset objects
    print(ds_hf.shape)
    for i in range(len(ds_hf.shape[0]):
        image = ds_hf[i] # this returns numpy array for image i

Although you don't have to worry about chunk size to read and write data, it's important to set an appropriate size for your use. That discussion goes beyond your question. Your size is good for your application.