I'm new to Polars and need some advice from the experts. I have some working code but I've got to believe theres a faster and/or more elegant way to do this. I've got a large dataframe with columns cik(int), form(string) and period(date) of relevance here. Form can have value either '10-Q' or '10-K'. Each cik will have many rows of the 2 form types with different periods represented. What I want to end up with is, for each cik group, only the most recent 10-Q remains and only the most recent 10 10-Ks remain. Of course if there are less than 10 10-K forms, all should remain. Here's what I'm doing now (it works):
def filter_sub_for_11_rows_per_cik(df_):
df = df_.sort('cik')
# Keep only the last 10-Q
q_filtered_df = df.group_by('cik').map_groups(
lambda g:
g.sort('period', descending=True).filter(pl.col('form').eq('10-Q')).head(1))
# Keep the last up to 10 10-Ks
k_filtered_df = df.group_by('cik').map_groups(
lambda g:
g.sort('period', descending=True)
.filter(pl.col('form').eq('10-K'))
.slice(0, min(10, g.filter(pl.col('form').eq('10-K')).shape[0]))
)
return pl.concat([q_filtered_df, k_filtered_df])
To simplify the example, I consider a dataframe with 3 10-Q and 2 10-K entries for each of two values of cik. I'll filter for the 2 most recent 10-K rows and the most recent 10-Q row for each group defined by cik.
To filter the dataframe for each group defined by cik, we can simply use
pl.DataFrame.filtertogether withpl.Expr.over(to define the groups) as follows.Explanation.
pl.Expr.overto do this filtering separately for each group defined cik and form (to ensure the index is being reset properly for each form).