Situation
- I have a parquet dataset of big size, that does not fit into RAM.
- I need to split the dataset into multiple partitions defined by a dictionary such as
{"split1": 0.33, "split2": 0.274, "split3": 0.396}. The result would be a dictionary ofpolars.LazyFrames, assuming it is possible to stream the results of the partitioning into.parquetfiles. - I need this to be as fast as possible.
- I want to use Polars to do this, in lazy evaluation mode, streaming
collect()at the end.
Suboptimal solution
I know I can do a single pass over the data to count them, then another pass with_row_count() for each of the partitions and select the appropriate ranges.
Hypothesis
I believe that it should be possible to do this "on the fly" in just a single pass, just iterating over the data and throwing them into separate baskets in a set frequency to match the fractions.
Actual question
What is the most efficient way to do this partitioning? I was not able to find a build-in function in Polars nor was I able to come up with any way to do this efficiently.
Comments
I am thinking of some mathematical expression based on modular arithmetics or residual division, but I can't seem to figure out the details.