I'm interested to know how one might re-write the following function, foo, within the functional programming paradigm. I can't figure out how to apply filter_df() to columns in kwargs and store the output without changing the the value of the original DataFrame object, df.
def foo(df, **kwargs):
for column, values in kwargs.items():
df = filter_df(df, column, values)
return df
def filter_df(df, column, values):
return df.loc[df[column].isin(values)].reset_index(drop=True)
An obvious solution to me might be to assign a new variable, df_new to the output of filter_df, e.g.
def foo(df, **kwargs):
for column, values in kwargs.items():
df_new = filter_df(df, column, values)
return df_new
However, this is not particularly memory efficient as df could be quite large. Also, I'm not sure if this option be would classed as purely functional because df_new is affected on each loop iteration.
It's not totally clear what you mean by
and by
Note that your second definition of
foodoesn't produce the same output as the first case, it only returns the rows that respect the last condition and ignores the remaining conditions.In each iteration,
filter_dfproduces a new object (the rows of DataFrame which satisfydf[column].isin(values)) sincereset_indexdoesn't act in-place.df_newis not "affected on each loop iteration", the namedf_newis simply re-binded (i.e. points to) to a new object in each iteration. The conditions are being applied separately, only the DataFrame resulting from the last one is returned.Solution
In this particular case,
foocan be simplified usingDataFrame.query. This way you don't create unnecessary intermediate DataFrames.