Streaming partitioning into partitions of set size in Polars

42 Views Asked by Adam Bajger At 19 March 2024 at 22:50

Situation

I have a parquet dataset of big size, that does not fit into RAM.
I need to split the dataset into multiple partitions defined by a dictionary such as {"split1": 0.33, "split2": 0.274, "split3": 0.396}. The result would be a dictionary of polars.LazyFrames, assuming it is possible to stream the results of the partitioning into .parquet files.
I need this to be as fast as possible.
I want to use Polars to do this, in lazy evaluation mode, streaming collect() at the end.

Suboptimal solution

I know I can do a single pass over the data to count them, then another pass with_row_count() for each of the partitions and select the appropriate ranges.

Hypothesis

I believe that it should be possible to do this "on the fly" in just a single pass, just iterating over the data and throwing them into separate baskets in a set frequency to match the fractions.

Actual question

What is the most efficient way to do this partitioning? I was not able to find a build-in function in Polars nor was I able to come up with any way to do this efficiently.

Comments

I am thinking of some mathematical expression based on modular arithmetics or residual division, but I can't seem to figure out the details.

Original Q&A

Streaming partitioning into partitions of set size in Polars

Situation

Suboptimal solution

Hypothesis

Actual question

Comments

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-POLARS

Related Questions in DATA-PARTITIONING

Trending Questions

Popular # Hahtags

Popular Questions