pandas goupby method slow and inefficient for sum()

45 Views Asked by doplano At 21 June 2023 at 15:48

Given:

A synthetic dataset in pandas DataFrame format users vs. tokens, generated via the following helper function:

import numpy as np
import pandas as pd
import string
import random

def get_df(nROWs:int=10, nCOLs:int=100, MIN=0.0, MAX=199.0):
    my_strings = string.printable
    df = pd.DataFrame(np.random.uniform(low=MIN, high=MAX, size=(nROWs, nCOLs)).astype("float16"), columns=list(map(lambda orig_string: "tk_"+orig_string, random.sample(my_strings, nCOLs))) )
    df["user_ip"] = [f"u{random.randint(0, nROWs)}" for r in range(nROWs)]
    return df

Goal:

I would like to sum values of each column up for the grouped users,

My inefficient solution:

Considering small dataframes as follows:

df1 = get_df(nROWs=3, nCOLs=5, MIN=0, MAX=10.0) # here `nCOLs` can't exceptionally go above 100, due to `len(string.printable)=100`
df2 = get_df(nROWs=5, nCOLs=4, MIN=0, MAX=5.0)
df3 = get_df(nROWs=7, nCOLs=9, MIN=0, MAX=1.0)

and concatenate them along axis=0 first:

df_c = pd.concat([df1, df2, df3,], axis=0)

then .groupby() method works fine for this small size:

d = dict()
for n, g in df_c.groupby("user_ip"):
    d[n] = g.loc[:, g.columns!="user_ip"].sum()
df_res = pd.DataFrame.from_dict(d, orient="index").astype("float16")

Problem:

Assuming, I have enough memory resources on a super computer and considering real dataframes with sizes on a scale of 15e+5 x 45e+3 or higher, it is super slow as in each execution inside the for loop takes roughly 10 ~ 15 sec:

df1 = get_df(nROWs=int(15e+5), nCOLs=100, MIN=0, MAX=200.0) # here `nCOLs` can't exceptionally go above 100, due to `len(string.printable)=100`
df2 = get_df(nROWs=int(3e+6), nCOLs=76, MIN=0, MAX=100.0)
df3 = get_df(nROWs=int(1e+3), nCOLs=88, MIN=0, MAX=0.9)

I was wondering if there might be a better and more efficient solution to deal with big sized data for this purpose.

Cheers,

Original Q&A

There are 1 best solutions below

chrslg On 21 June 2023 at 16:06

You are using for loops. That is a hard "no" for efficiency. Sure, you are iterating over groups only, not over all rows. That would not be a problem if you had only a few groups (a number of groups at least negligible before the average size of the groups).

But that is not your case. You group by user_ip, and user_ip which are random values over a span equal to the number of rows. That means that number of groups is 63.2% the same as the number of rows (that is statistics: the number of unique value in a distribution on N N-ary values tends toward 63.2% of N).

So, you could have skipped groupby and pandas .sum altogether, and just iterated over all the rows of your dataframe with a naive pure python code, to increment d[userip] each time, that would have been only a bit slower (even maybe faster. Sure, you have 1/0.632=58% more iterations to do, but they are also simpler). Not that I advise at all this method. On the contrary. Just to mention the fact that what you did is as inefficient as that infamous "just iterate the rows" that is known to be the ultimate ineffecient way

You must find ways to remove all for loops (there is also one in the get_df, that could be easily replaced by np.random.randint(0,nRows, nRows), even if that is not your concern)

For example, (just a quick proposal, you have to check if it works, and maybe adapt it)

dfg = df.groupby('user_ip').sum()
df_res = pd.DataFrame(dfg.values.sum(axis=1), dfg.index)

pandas goupby method slow and inefficient for sum()

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in PERFORMANCE

Related Questions in CODING-EFFICIENCY

Trending Questions

Popular # Hahtags

Popular Questions