Why does parallel reading of an HDF5 Dataset max out at 100% CPU, but only for large Datasets?

196 Views Asked by At

I'm using Cython to read a single Dataset from an HDF5 file using 64 threads. Each thread calculates a start index start and chunk size size, and reads from that chunk into a common buffer buf, which is a memoryview of a NumPy array. Crucially, each thread opens its own copy of the file and Dataset. Here's the code:

def read_hdf5_dataset(const char* file_name, const char* dataset_name,
                      long[::1] buf, int num_threads):
    cdef hsize_t base_size = buf.shape[0] // num_threads
    cdef hsize_t start, size
    cdef hid_t file_id, dataset_id, mem_space_id, file_space_id
    cdef int thread
    for thread in prange(num_threads, nogil=True):
        start = base_size * thread
        size = base_size + buf.shape[0] % num_threads \
            if thread == num_threads - 1 else base_size
        file_id = H5Fopen(file_name, H5F_ACC_RDONLY, H5P_DEFAULT)
        dataset_id = H5Dopen2(file_id, dataset_name, H5F_ACC_RDONLY)
        mem_space_id = H5Screate_simple(1, &size, NULL)
        file_space_id = H5Dget_space(dataset_id)
        H5Sselect_hyperslab(file_space_id, H5S_SELECT_SET, &start,
                            NULL, &size, NULL)
        H5Dread(dataset_id, H5Dget_type(dataset_id), mem_space_id,
                file_space_id, H5P_DEFAULT, <void*> &buf[start])
        H5Sclose(file_space_id)
        H5Sclose(mem_space_id)
        H5Dclose(dataset_id)
        H5Fclose(file_id)

Although it reads the Dataset correctly, the CPU utilization maxes out at exactly 100% on a float32 Dataset of ~10 billion entries, even though it uses all 64 CPUs (albeit only at ~20-30% utilization due to the I/O bottleneck) on a float32 Dataset of ~100 million entries. I've tried this on two different computing clusters with the same result. Maybe it has something to do with the size of the Dataset being greater than INT32_MAX?

What's stopping this code from running in parallel on extremely large datasets, and how can I fix it? Any other suggestions to improve the code's clarity or efficiency would also be appreciated.

2

There are 2 best solutions below

1
tel On BEST ANSWER

Something is happening that is either preventing cython's prange from launching multiple threads, or is preventing the threads from getting anywhere once launched. It may or may not have anything to do directly with hdf5. Here's some possible causes:

  • Are you pre-allocating a buf large enough to hold the entire dataset before running your function? If so, that means your program is allocating 40+ gigabytes of memory (4 bytes per float32). How much memory do the nodes you're running on have? Are you the only user? Memory starvation could easily cause the kind of performance issues you describe.

  • Both cython and hdf5 require certain compilation flags in order to correctly support parallelism. Between your small and large dataset runs did you modify or recompile your code at all?

  • One easy way to explain why your program is using 100% of a single cpu is that it's getting hung somewhere before your read_hdf5_dataset function is ever called. What other code in your program runs first, and could it be causing the problems you see?

Part of the problem here is that it is going to be very hard for any users on this site to reproduce your exact issue, since we don't have most of your program and I at least don't have any 40 gig hdf5 files lying around (back in my grad school days tho, terabytes). If one of my above suggestions doesn't help, I think you have two ways forward:

  • Try to come up with a simplified repro of your issue, then edit your question to post it here.
  • Using a combination of debugger and profiler (and print statements, if you're feeling lame), try to track down the exact line your program is getting hung up on when single cpu utilization spins up to 100%. That alone should tell you a whole lot more about what's going on. In particular it should it very clear whether anything is getting locked down by a mutex, as @Homer512 suggested in his comments.
1
Steve Rosam On

The issue you're experiencing might be due to the Global Interpreter Lock (GIL) in Python. Even though you're using prange with nogil=True, the HDF5 library you're using might not be releasing the GIL during I/O operations. This would limit the parallelism of your code, causing it to run on a single core despite having multiple threads.

To address this, you could try using multiprocessing instead of multithreading. This would involve creating separate processes, each with its own interpreter and memory space, which can truly run in parallel. However, this would also involve more overhead in terms of inter-process communication and might not be suitable if the threads need to share a large amount of data.

Here's a simplified example of how you might use multiprocessing in Python:

from multiprocessing import Pool

def read_chunk(args):
    file_name, dataset_name, start, size = args
    # Your HDF5 reading code here

if __name__ == "__main__":
    with Pool(num_processes) as p:
        p.map(read_chunk, [(file_name, dataset_name, start, size) for start, size in chunks])

read_chunk is a function that reads a chunk of the HDF5 file. The Pool.map function applies this function to each element of the second argument, which is a list of arguments for each chunk. Each element of this list is passed as a separate argument to read_chunk.

You probably need to handle the communication of data between processes differently, depending on the size and structure of your data.

For efficiency, you might want to look into whether the HDF5 library you're using supports parallel I/O natively. If it does, this could potentially be more efficient than manually dividing the file into chunks and reading each chunk in a separate thread or process.