Given:
A synthetic dataset in pandas DataFrame format users vs. tokens, generated via the following helper function:
import numpy as np
import pandas as pd
import string
import random
def get_df(nROWs:int=10, nCOLs:int=100, MIN=0.0, MAX=199.0):
my_strings = string.printable
df = pd.DataFrame(np.random.uniform(low=MIN, high=MAX, size=(nROWs, nCOLs)).astype("float16"), columns=list(map(lambda orig_string: "tk_"+orig_string, random.sample(my_strings, nCOLs))) )
df["user_ip"] = [f"u{random.randint(0, nROWs)}" for r in range(nROWs)]
return df
Goal:
I would like to sum values of each column up for the grouped users,
My inefficient solution:
Considering small dataframes as follows:
df1 = get_df(nROWs=3, nCOLs=5, MIN=0, MAX=10.0) # here `nCOLs` can't exceptionally go above 100, due to `len(string.printable)=100`
df2 = get_df(nROWs=5, nCOLs=4, MIN=0, MAX=5.0)
df3 = get_df(nROWs=7, nCOLs=9, MIN=0, MAX=1.0)
and concatenate them along axis=0 first:
df_c = pd.concat([df1, df2, df3,], axis=0)
then .groupby() method works fine for this small size:
d = dict()
for n, g in df_c.groupby("user_ip"):
d[n] = g.loc[:, g.columns!="user_ip"].sum()
df_res = pd.DataFrame.from_dict(d, orient="index").astype("float16")
Assuming, I have enough memory resources on a super computer and considering real dataframes with sizes on a scale of 15e+5 x 45e+3 or higher, it is super slow as in each execution inside the for loop takes roughly 10 ~ 15 sec:
df1 = get_df(nROWs=int(15e+5), nCOLs=100, MIN=0, MAX=200.0) # here `nCOLs` can't exceptionally go above 100, due to `len(string.printable)=100`
df2 = get_df(nROWs=int(3e+6), nCOLs=76, MIN=0, MAX=100.0)
df3 = get_df(nROWs=int(1e+3), nCOLs=88, MIN=0, MAX=0.9)
I was wondering if there might be a better and more efficient solution to deal with big sized data for this purpose.
Cheers,




You are using
forloops. That is a hard "no" for efficiency. Sure, you are iterating over groups only, not over all rows. That would not be a problem if you had only a few groups (a number of groups at least negligible before the average size of the groups).But that is not your case. You group by
user_ip, anduser_ipwhich are random values over a span equal to the number of rows. That means that number of groups is 63.2% the same as the number of rows (that is statistics: the number of unique value in a distribution on N N-ary values tends toward 63.2% of N).So, you could have skipped groupby and pandas
.sumaltogether, and just iterated over all the rows of your dataframe with a naive pure python code, to incrementd[userip]each time, that would have been only a bit slower (even maybe faster. Sure, you have 1/0.632=58% more iterations to do, but they are also simpler). Not that I advise at all this method. On the contrary. Just to mention the fact that what you did is as inefficient as that infamous "just iterate the rows" that is known to be the ultimate ineffecient wayYou must find ways to remove all for loops (there is also one in the
get_df, that could be easily replaced bynp.random.randint(0,nRows, nRows), even if that is not your concern)For example, (just a quick proposal, you have to check if it works, and maybe adapt it)