reading multiple csv.gz file into dask dataframe

81 Views Asked by At

I have multiple .csv.gz files which I'm trying to read into a dask dataframe, I was able to achive this using this code :

file_paths = glob.glob(file_pattern)
@delayed
def read_csv(file_paths):
    return dd.read_csv(file_paths, compression='gzip', blocksize=None,dtype=None)

dfs=[delayed(pd.read_csv)(fn) for fn in file_paths]
df = dd.from_delayed(dfs)

The problem is that when i tried converting the dask dataframe into pandas dataframe using df=df.compute()

I get the error message: "EmptyDataError: No columns to parse from file" I would really appreciate any help with this

2

There are 2 best solutions below

3
Pawan Tolani On

The below worked for me:

import os
import pandas as pd
import dask.dataframe as dd
file_path=r"C:\Users\John Doe\Downloads\checking gz"

dfs=[]
files=os.listdir(file_path)
for file in files:
    if '.gz' in file:
        df=dd.read_csv(file_path+'/'+file, compression='gzip',blocksize=None,error_bad_lines =False)
        dfs.append(df)
        print(df)
        
new_df=dd.concat(dfs)
pd_df=new_df.compute()
0
mdurant On

Calling a dask high-level API like dask.dataframe inside a delayed function is not a good idea. In fact, your requirement should be a one-liner:

df = dd.read_csv(file_pattern, compression='gzip', blocksize=None)

and dask (or actually fsspec) will evaluate the pattern for you and make one partition per input file.