How can I create a value centred sliding window in polars?

77 Views Asked by At

In python I wrote a generator which returns what I call a 'value-centred' sliding window over the data. For example, given the data:

v = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

which when called like this produces:

   for index, window in sliding_window_iter(v, 3):
        print("value={} window={}".format(v[index], window))

    value=1 window=[1, 2, 3]
    value=2 window=[1, 2, 3]
    value=3 window=[2, 3, 4]
    value=4 window=[3, 4, 5]
    value=5 window=[4, 5, 6]
    value=6 window=[5, 6, 7]
    value=7 window=[6, 7, 8]
    value=8 window=[7, 8, 9]
    value=9 window=[8, 9, 10]
    value=10 window=[8, 9, 10]

As you can see it produces a tuple as output: (centre_value, window)

How might I re-implement this in polars?

The closest thing to it would be:

df.group_by_dynamic with include_boundaries=True but obviously it's not the same.

I'd prefer an implementation that is stream oriented (i.e.: which does not require reading in all the data into memory).

1

There are 1 best solutions below

6
Hericks On

TLDR. For a window of size 2*k + 1 the following can be used.

(
    df
    .rolling(
        index_column=pl.int_range(pl.len()).alias("index"),
        period=f"{2*k+1}i"
    )
    .agg(
        pl.col("val")
    )
    .with_columns(
        pl.when(
            pl.int_range(pl.len()) >= k
        ).then(
            pl.col("val").shift(-k)
        )
        .forward_fill().backward_fill()
    )
)

Explanation

In sounds like the value-centered sliding window is defined only for odd window sizes (such that there is a unique center). In the following, we therefore consider window sizes of the form window_size = 2*k + 1 for some positive k.

Example data.

import polars as pl

k = 1

df = pl.DataFrame({
    "val": list(range(6))
})
shape: (6, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 1   │
│ 2   │
│ 3   │
│ 4   │
│ 5   │
└─────┘

Indeed, shifting the result of polars.DataFrame.rolling with a period = 2*k + 1 by -kmostly does the correct thing here.

(
    df
    .rolling(
        index_column=pl.int_range(pl.len()).alias("index"),
        period=f"{2*k+1}i"
    )
    .agg(
        pl.col("val")
    )
    .with_columns(
        pl.col("val").shift(-k)
    )
)
shape: (6, 2)
┌───────┬───────────┐
│ index ┆ val       │
│ ---   ┆ ---       │
│ i64   ┆ list[i64] │
╞═══════╪═══════════╡
│ 0     ┆ [0, 1]    │ # wrong
│ 1     ┆ [0, 1, 2] │ # correct
│ 2     ┆ [1, 2, 3] │ # correct
│ 3     ┆ [2, 3, 4] │ # correct
│ 4     ┆ [3, 4, 5] │ # correct
│ 5     ┆ null      │ # wrong
└───────┴───────────┘

Note. I visualise the result by aggregating the window elements into a list, but really any aggregation could be used.

This is correct except for

  • the first k rows, where the window does not contain enough elements, and
  • the last k rows, which are missing after the shift.

Now, the idea is to use a simple pl.when().then() construct to also overwrite the first k rows with None. This way, the first and last k rows are missing.

Finally, we can use a forward/backward fill to fill the missing rows with the desired values.

(
    df
    .rolling(
        index_column=pl.int_range(pl.len()).alias("index"),
        period=f"{2*k+1}i"
    )
    .agg(
        pl.col("val")
    )
    .with_columns(
        pl.when(
            pl.int_range(pl.len()) >= k
        ).then(
            pl.col("val").shift(-k)
        )
        .forward_fill().backward_fill()
    )
)

Note. The initial shift from before was moved inside the pl.when().then() construct. It would've also been possible to set the offset parameter of pl.DataFrame.rolling to -k-1. However, then we'd need to set the first and last k rows of the column to None.

shape: (6, 2)
┌───────┬───────────┐
│ index ┆ val       │
│ ---   ┆ ---       │
│ i64   ┆ list[i64] │
╞═══════╪═══════════╡
│ 0     ┆ [0, 1, 2] │
│ 1     ┆ [0, 1, 2] │
│ 2     ┆ [1, 2, 3] │
│ 3     ┆ [2, 3, 4] │
│ 4     ┆ [3, 4, 5] │
│ 5     ┆ [3, 4, 5] │
└───────┴───────────┘