how to split a large csv writing to several files by multiple processing in python?

929 Views Asked by At

I have a very large csv file (40G), and I want to split it into 10 df by column and then write each to csv file (about 4G each). To save time, I choose multiple processing to process it. But I found the mp doesn't work, it still processes one by one. I wonder if we cannot write large files by mp? here my code goes:

def split(i, output_path, original_large_data_path):
    data = pandas.read_csv(original_large_data_path) #read in the large data
    new_data = data[i].dropna(how = 'all', subset = [i]) #split the data and drop na based on seperated df
    new_data.to_csv(os.path.join(output_path, '{}.csv'.format(i)) #write csv
    
pool = Pool(5)
for i in [some columns]:
    r = pool.apply_async(split,(i,output_path, original_large_data_path,))
pool.close()
pool.join()

1

There are 1 best solutions below

0
SIGHUP On

Use map, partial and a context manager as follows:

import pandas
from functools import partial
from multiprocessing import Pool
import os

INFILE = '' # Path to input file
OUTPATH = '' # target directory

def split(infile, outpath, col):
    data = pandas.read_csv(infile)
    new_data = data[col].dropna(how = 'all', subset = [col])
    new_data.to_csv(os.path.join(outpath, f'{col}.csv'))

def main():
    with Pool() as pool:
        pool.map(partial(split, INFILE, OUTPATH), range(10))

if __name__ == '__main__':
    main()