Load hdf files in parallel from dask dataframe

30 Views Asked by At

I have a dask dataframe that looks as follows:

digitizer_get_current_savefile  motor_get_position  motor_goto_strip    nevts   biasvoltage z-height    x-dist
0   /home/data/tct-waveforms/waveform...    [-76399, -20270, -1283] 1.0 15  -200.0  -1283   -76399
1   /home/data/tct-waveforms/waveform...    [-76404, -20270, -1283] 2.0 15  -200.0  -1283   -76404
2   /home/data/tct-waveforms/waveform...    [-76409, -20270, -1283] 3.0 15  -200.0  -1283   -76409
3   /home/data/tct-waveforms/waveform...    [-76414, -20270, -1283] 4.0 15  -200.0  -1283   -76414
4   /home/data/tct-waveforms/waveform...    [-76419, -20270, -1283] 5.0 15  -200.0  -1283   -76419

I want to leverage dask's single machine parallelization and, in the next step, load hdf5 data files that are located at the paths in the digitizer_get_current_savefile column in parallel.

For that, I have written this code:

import dask.dataframe as dd

channel = "CH0"
def extract_signal(row):
    # Read the hdf5 data file
    df_data = dd.read_hdf(row["digitizer_get_current_savefile"], key=channel)

    # Drop all columns in the datafile that begin with "Time" (only keeping the amplitudes)
    df_data = df_data.loc[:, ~df_data.columns.str.startswith("Time")]

    return df_data.mean().max()

df['signal'] = df.apply(extract_signal, axis=1)

This does not work. Error:

OSError: File(s) not found: a

Somehow, the file paths are not recognized... I'm a beginner with dask, please excuse if I made a stupid mistake.

A pandas-based version of this code works fine.

0

There are 0 best solutions below