I'm trying to do an .explode on a column, and stream or sink to file, but one of the lists has 300k items (6.7mil characters if combined into a string).
import polars as pl
test = pl.LazyFrame({'col1': 'string '*1_000_000})
(test
.with_columns(explode_me = pl.col('col1').str.split(' '))
.explode(pl.col('explode_me'))
.collect(streaming=True)
.write_parquet('file.parquet')
)
This issue was created, but "a single row explodes to more than fits into memory. There is not much what we can do with the current architecture. At absolute minimum, the explosion of a single row should fit."
How do I best split the oversized lists into lists with fewer items so my later .explode will fit into memory? (possibily using pl.when())
Basically, split the string every 50k words so I can explode to 6 rows, so I can then later explode 6 rows of 50k, instead of 1 row of 300k (which overloads memory).
EDIT: My current solution
import polars as pl
test = pl.LazyFrame({'col1': 'string '*1_000_000})
(test
.with_columns(explode_me = pl.col('col1').str.split(' '))
.with_columns(
pl.col('explode_me').apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)],
return_dtype= pl.List(pl.List(pl.Utf8)))
)
.select(pl.col('explode_me'))
.explode(pl.col('explode_me'))
.sink_parquet('file.parquet')
)
You can use the
list.slicemethod to chunk the list into smaller lists. The below takes your example (with only 10 strings) and chunks them into 5 columns of lists of two strings. I save the chunk names and expressions in the chunk_cols so you can set the chunking logic at run time with whatever logic you want.