pandas goupby method slow and inefficient for sum()

45 Views Asked by At

Given:

A synthetic dataset in pandas DataFrame format users vs. tokens, generated via the following helper function:

import numpy as np
import pandas as pd
import string
import random

def get_df(nROWs:int=10, nCOLs:int=100, MIN=0.0, MAX=199.0):
    my_strings = string.printable
    df = pd.DataFrame(np.random.uniform(low=MIN, high=MAX, size=(nROWs, nCOLs)).astype("float16"), columns=list(map(lambda orig_string: "tk_"+orig_string, random.sample(my_strings, nCOLs))) )
    df["user_ip"] = [f"u{random.randint(0, nROWs)}" for r in range(nROWs)]
    return df

Goal:

I would like to sum values of each column up for the grouped users,

My inefficient solution:

Considering small dataframes as follows:

df1 = get_df(nROWs=3, nCOLs=5, MIN=0, MAX=10.0) # here `nCOLs` can't exceptionally go above 100, due to `len(string.printable)=100`
df2 = get_df(nROWs=5, nCOLs=4, MIN=0, MAX=5.0)
df3 = get_df(nROWs=7, nCOLs=9, MIN=0, MAX=1.0)

enter image description here enter image description here enter image description here

and concatenate them along axis=0 first:

df_c = pd.concat([df1, df2, df3,], axis=0)

enter image description here then .groupby() method works fine for this small size:

d = dict()
for n, g in df_c.groupby("user_ip"):
    d[n] = g.loc[:, g.columns!="user_ip"].sum()
df_res = pd.DataFrame.from_dict(d, orient="index").astype("float16")

enter image description here Problem:

Assuming, I have enough memory resources on a super computer and considering real dataframes with sizes on a scale of 15e+5 x 45e+3 or higher, it is super slow as in each execution inside the for loop takes roughly 10 ~ 15 sec:

df1 = get_df(nROWs=int(15e+5), nCOLs=100, MIN=0, MAX=200.0) # here `nCOLs` can't exceptionally go above 100, due to `len(string.printable)=100`
df2 = get_df(nROWs=int(3e+6), nCOLs=76, MIN=0, MAX=100.0)
df3 = get_df(nROWs=int(1e+3), nCOLs=88, MIN=0, MAX=0.9)

I was wondering if there might be a better and more efficient solution to deal with big sized data for this purpose.

Cheers,

1

There are 1 best solutions below

0
chrslg On

You are using for loops. That is a hard "no" for efficiency. Sure, you are iterating over groups only, not over all rows. That would not be a problem if you had only a few groups (a number of groups at least negligible before the average size of the groups).

But that is not your case. You group by user_ip, and user_ip which are random values over a span equal to the number of rows. That means that number of groups is 63.2% the same as the number of rows (that is statistics: the number of unique value in a distribution on N N-ary values tends toward 63.2% of N).

So, you could have skipped groupby and pandas .sum altogether, and just iterated over all the rows of your dataframe with a naive pure python code, to increment d[userip] each time, that would have been only a bit slower (even maybe faster. Sure, you have 1/0.632=58% more iterations to do, but they are also simpler). Not that I advise at all this method. On the contrary. Just to mention the fact that what you did is as inefficient as that infamous "just iterate the rows" that is known to be the ultimate ineffecient way

You must find ways to remove all for loops (there is also one in the get_df, that could be easily replaced by np.random.randint(0,nRows, nRows), even if that is not your concern)

For example, (just a quick proposal, you have to check if it works, and maybe adapt it)

dfg = df.groupby('user_ip').sum()
df_res = pd.DataFrame(dfg.values.sum(axis=1), dfg.index)