I have a dataset df with three columns: 'String_key_val', 'Float_other_val1', 'Int_other_val2'. I want to groupby on key_val, then extract the sum of val1 (resp. val2) with respect to these groups. Here is my code:
df = pandas.read_csv('test.csv')
grouped = df.groupby('String_key_val')
series_calculus1 = grouped['Float_other_val1'].sum()
series_calculus2 = grouped['Int_other_val2'].sum()
res = pandas.concat([series_calculus1, series_calculus2], axis=1)
res.to_csv('output_test.csv')
My problem is: My entry dataset is 10GB and I have 4Go Ram so I need to chunk my calculus but I can't see how. I thought of using HDFStore, but since I only have to build a numerical dataset, I see no point of storing complete DataFrame, and I don't think HDFStore can store simple arrays.
What can I do?
I believe a simple approach would be something along these lines....