Filtering inside groups in polars

Question

Filtering inside groups in polars

68 Views Asked by MikeP At 31 March 2024 at 20:38

I'm new to Polars and need some advice from the experts. I have some working code but I've got to believe theres a faster and/or more elegant way to do this. I've got a large dataframe with columns cik(int), form(string) and period(date) of relevance here. Form can have value either '10-Q' or '10-K'. Each cik will have many rows of the 2 form types with different periods represented. What I want to end up with is, for each cik group, only the most recent 10-Q remains and only the most recent 10 10-Ks remain. Of course if there are less than 10 10-K forms, all should remain. Here's what I'm doing now (it works):

def filter_sub_for_11_rows_per_cik(df_):
    df = df_.sort('cik')
    # Keep only the last 10-Q
    q_filtered_df = df.group_by('cik').map_groups(
        lambda g:
        g.sort('period', descending=True).filter(pl.col('form').eq('10-Q')).head(1))
    # Keep the last up to 10 10-Ks
    k_filtered_df = df.group_by('cik').map_groups(
        lambda g:
        g.sort('period', descending=True)
        .filter(pl.col('form').eq('10-K'))
        .slice(0, min(10, g.filter(pl.col('form').eq('10-K')).shape[0]))
        )
    return pl.concat([q_filtered_df, k_filtered_df])

Original Q&A

There are 1 best solutions below

**Hericks** · Accepted Answer · 2024-03-31T22:08:38.603000

To simplify the example, I consider a dataframe with 3 10-Q and 2 10-K entries for each of two values of cik. I'll filter for the 2 most recent 10-K rows and the most recent 10-Q row for each group defined by cik.

import polars as pl
import datetime

df = pl.DataFrame({
    "cik": [0] * 5 + [1] * 5,
    "form": (["10-Q"] * 2 + ["10-K"] * 3) * 2,
    "period": [datetime.date(2021, 1, 1+day) for day in range(10)],
})

shape: (10, 3)
┌─────┬──────┬────────────┐
│ cik ┆ form ┆ period     │
│ --- ┆ ---  ┆ ---        │
│ i64 ┆ str  ┆ date       │
╞═════╪══════╪════════════╡
│ 0   ┆ 10-Q ┆ 2021-01-01 │
│ 0   ┆ 10-Q ┆ 2021-01-02 │
│ 0   ┆ 10-K ┆ 2021-01-03 │
│ 0   ┆ 10-K ┆ 2021-01-04 │
│ 0   ┆ 10-K ┆ 2021-01-05 │
│ 1   ┆ 10-Q ┆ 2021-01-06 │
│ 1   ┆ 10-Q ┆ 2021-01-07 │
│ 1   ┆ 10-K ┆ 2021-01-08 │
│ 1   ┆ 10-K ┆ 2021-01-09 │
│ 1   ┆ 10-K ┆ 2021-01-10 │
└─────┴──────┴────────────┘

To filter the dataframe for each group defined by cik, we can simply use pl.DataFrame.filter together with pl.Expr.over (to define the groups) as follows.

(
    df
    .sort(by=["cik", "form", "period"], descending=[False, False, True])
    .filter(
        (
            ((pl.col("form") == "10-Q") & (pl.int_range(pl.len()) == 0)) |
            ((pl.col("form") == "10-K") & (pl.int_range(pl.len()) < 2))
        )
        .over("cik", "form")
    )
)

shape: (6, 3)
┌─────┬──────┬────────────┐
│ cik ┆ form ┆ period     │
│ --- ┆ ---  ┆ ---        │
│ i64 ┆ str  ┆ date       │
╞═════╪══════╪════════════╡
│ 0   ┆ 10-K ┆ 2021-01-05 │
│ 0   ┆ 10-K ┆ 2021-01-04 │
│ 0   ┆ 10-Q ┆ 2021-01-02 │
│ 1   ┆ 10-K ┆ 2021-01-10 │
│ 1   ┆ 10-K ┆ 2021-01-09 │
│ 1   ┆ 10-Q ┆ 2021-01-07 │
└─────┴──────┴────────────┘

Explanation.

We sort the DataFrame in descending order by date for each group defined by cik and form.
We filter the dataframe for rows with form being 10-K and the row index being less than 2 (0 or 1 - with your data, you'd filter rows with row index less than 10) or form being 10-K and the row index being 0, i.e. the most recent entry. We use pl.Expr.over to do this filtering separately for each group defined cik and form (to ensure the index is being reset properly for each form).

Filtering inside groups in polars

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-POLARS

Trending Questions

Popular # Hahtags

Popular Questions