Using dynamic cut() breaks for each row of a dataframe

65 Views Asked by At

I am trying to bin values to prepare data to be later fed into a plotting library.

For this I am trying to use polars Expr.cut. The dataframe I operate on contains different groups of values, each of these groups should be binned using different breaks. Ideally I would like to use np.linspace(BinMin, BinMax, 50) for the breaks argument of Expr.cut.

I managed to make the BinMin and BinMax columns in the dataframe. But I can't manage to use np.linspace to define the breaks dynamically for each row of the dataframe.

This is a minimal example of what I tried:

import numpy as np
import polars as pl

df = pl.DataFrame({"Value": [12], "BinMin": [0], "BinMax": [100]})

At this point the dataframe looks like:

┌───────┬────────┬────────┐
│ Value ┆ BinMin ┆ BinMax │
│ ---   ┆ ---    ┆ ---    │
│ i64   ┆ i64    ┆ i64    │
╞═══════╪════════╪════════╡
│ 12    ┆ 0      ┆ 100    │
└───────┴────────┴────────┘

And trying to use Expr.cut with dynamic breaks:

df.with_columns(pl.col("Value").cut(breaks=np.linspace(pl.col("BinMin"), pl.col("BinMax"), 50)).alias("Bin"))


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.with_columns(pl.col("Value").cut(breaks=range(pl.col("BinMin"), pl.col("BinMax"))).alias("Bin"))

TypeError: 'Expr' object cannot be interpreted as an integer

I understand the error, that np.linspace is expecting to be called with actual scalar integers, not polars Expr. But I cannot figure out how to call it with dynamic breaks derived from the BinMin and BinMax columns.

2

There are 2 best solutions below

0
Hericks On

Unfortunately, pl.Expr.cut doesn't support expressions for the breaks argument (yet), but requires a fixed sequence.

(This would be a good feature request though).

A naive solution that will work for DataFrames, but doesn't use polars' native expression API, would be to use pl.Expr.map_elements together with the corresponding functionality in numpy.

def my_cut(x, num=50):
    seq = np.linspace(x["BinMin"], x["BinMax"], num=num)
    idx = np.digitize(x["Value"], seq)
    return seq[idx-1:idx+1].tolist()

(
    df
    .with_columns(
        pl.struct("Value", "BinMin", "BinMax").map_elements(my_cut).alias("Bin")
    )
)
shape: (1, 4)
┌───────┬────────┬────────┬────────────────────────┐
│ Value ┆ BinMin ┆ BinMax ┆ Bin                    │
│ ---   ┆ ---    ┆ ---    ┆ ---                    │
│ i64   ┆ i64    ┆ i64    ┆ list[f64]              │
╞═══════╪════════╪════════╪════════════════════════╡
│ 12    ┆ 0      ┆ 100    ┆ [10.204082, 12.244898] │
└───────┴────────┴────────┴────────────────────────┘
0
Dean MacGregor On

You can reproduce np.linspace with pl.arange and some math.

Observe that linspace is giving you intervals of (max-min)/(n-1)

Then it's just this

df.with_columns(
    linspace=(
        pl.col("BinMin")+pl.arange(0,(n:=50))*(pl.col("BinMax")-pl.col("BinMin"))/(n-1)
        ).implode()
)

But of course you don't want the bins like that, you want to know which Bin that Value is in. For that, we can use the walrus operator and then sort the bins themselves by the absolute value of their difference the the Value so that the first two are the ones you want and then just take those 2.

Here's the whole thing...

df.with_columns(
    Bin = (
        Bins:=(
            pl.col("BinMin")+
            pl.arange(0,(n:=50))*(pl.col("BinMax")-pl.col("BinMin"))/(n-1)
            )
        )
    .sort_by((pl.col("Value")-Bins).abs())
    .implode()
    .list.gather([0,1])
    .list.reverse()
)

TL/DR start here

But of course that's unwieldy so let's put it in a function and monkey patch it so it's useable.

def exprcut(self, _min, _max, n):
    if isinstance(_min, str):
        _min=pl.col(_min)
    if isinstance(_max, str):
        _max=pl.col(_max)        
    return (
        Bins:=(_min+pl.arange(0,n)*(_max-_min)/(n-1))
        ).sort_by((self-Bins).abs()
                  ).implode().list.gather([0,1]).list.reverse()
pl.Expr.exprcut=exprcut

Now we can just do

df.with_columns(Bin=pl.col("Value").exprcut("BinMin", "BinMax",50))
shape: (1, 4)
┌───────┬────────┬────────┬────────────────────────┐
│ Value ┆ BinMin ┆ BinMax ┆ Bin                    │
│ ---   ┆ ---    ┆ ---    ┆ ---                    │
│ i64   ┆ i64    ┆ i64    ┆ list[f64]              │
╞═══════╪════════╪════════╪════════════════════════╡
│ 12    ┆ 0      ┆ 100    ┆ [10.204082, 12.244898] │
└───────┴────────┴────────┴────────────────────────┘