Polars: efficiently get the 2nd largest element

66 Views Asked by At

In polars, how do you efficiently get the 2nd largest element, or nth for some small n compared to the size of the column?

1

There are 1 best solutions below

10
mozway On

You could use top_k and slice the last row:

import polars as pl

np.random.seed(0)
df = pl.DataFrame({'col': np.random.choice(5, size=5, replace=False)})

out = df.top_k(2, by='col')[-1]

You could also filter with rank, but this will perform a full sort, so it could be algorithmically less efficient:

out = df.filter(pl.col('col').rank(descending=True)==2)

Or with 's argpartition:

out = df[int(np.argpartition(df['col'], -N)[-N])]

Output:

shape: (1, 1)
┌─────┐
│ col │
│ --- │
│ i64 │
╞═════╡
│ 3   │
└─────┘

Input:

shape: (5, 1)
┌─────┐
│ col │
│ --- │
│ i64 │
╞═════╡
│ 2   │
│ 0   │
│ 1   │
│ 3   │
│ 4   │
└─────┘
timings:

1M rows

# top_k
39.3 ms ± 7.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# filter+rank
54.6 ms ± 8.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

10M rows:

# top_k
427 ms ± 84.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# filter+rank
639 ms ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

100M rows

# top_k
4.04 s ± 411 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# filter+rank
6.12 s ± 244 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# argpartition
1.48 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1B rows:

# top_k
??? (crashed)

# filter+rank
1min 6s ± 559 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# argpartition
7.86 s ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)