Polars idiomatic way of aggregating n consecutive rows of a data frame

98 Views Asked by At

I'm new to Polars, and I ended up writing this code to compute some aggregating expression over segments of n rows:

import polars as pl

df = pl.DataFrame({"a": [1, 1, 3, 8, 62, 535, 4213]})

(
    df.with_columns(index=pl.int_range(pl.len(), dtype=pl.Int32))
    .group_by_dynamic(index_column="index", every="3i")
    .agg(pl.col("a").mean())
)

For the example I set n==3 for 7 rows, but think of a smallish n of about 100, for a multicolumn data frame of about 10**6 rows.

I was wondering if this is the idiomatic way of doing this type of operation. Somehow group_by_dynamic over an Int32 range seems overkill to me: I was wondering if there is a more direct way of doing the same aggregation.

2

There are 2 best solutions below

0
Hericks On BEST ANSWER

IMO your solution using group_by_dynamic already follows best practices when it comes to the aggregation.

However, you can simplify the creation of the index column quite a bit using pl.DataFrame.with_row_index. As the result is unsigned (and group_by_dyanmic only allows for a signed integer index column), you'll need to pass an expression doing the casting, i.e.

(
    df
    .with_row_index()
    .group_by_dynamic(index_column=pl.col("index").cast(pl.Int32), every="3i")
    .agg(pl.col("a").mean())
)
0
Roman Pekar On

group_by_dynamic allows one row to be a member of multiple groups, which is not really needed in your case.

What you need is to assign a "bucket" number to each row and then group by this number. This is similar to ntile() sql function.

To achieve that you can use, for example, pl.int_range(pl.len()) // 3 = (0,0,0,1,1,1,2...) as group_by column:

(
    df
    .group_by(pl.int_range(pl.len()) // 3, maintain_order=True)
    .agg(pl.col("a").mean())
)

┌─────────┬────────────┐
│ literal ┆ a          │
│ ---     ┆ ---        │
│ i64     ┆ f64        │
╞═════════╪════════════╡
│ 0       ┆ 1.666667   │
│ 1       ┆ 201.666667 │
│ 2       ┆ 4213.0     │
└─────────┴────────────┘