Given a row with a list too big to explode(), how do I do a preparatory explode() to bring list size down to fit into memory?

160 Views Asked by Thomas At 27 June 2023 at 12:01

I'm trying to do an .explode on a column, and stream or sink to file, but one of the lists has 300k items (6.7mil characters if combined into a string).

import polars as pl

test = pl.LazyFrame({'col1': 'string '*1_000_000})
(test
 .with_columns(explode_me = pl.col('col1').str.split(' '))
 .explode(pl.col('explode_me'))
 .collect(streaming=True)
 .write_parquet('file.parquet')
)

This issue was created, but "a single row explodes to more than fits into memory. There is not much what we can do with the current architecture. At absolute minimum, the explosion of a single row should fit."

How do I best split the oversized lists into lists with fewer items so my later .explode will fit into memory? (possibily using pl.when())

Basically, split the string every 50k words so I can explode to 6 rows, so I can then later explode 6 rows of 50k, instead of 1 row of 300k (which overloads memory).

EDIT: My current solution

import polars as pl

test = pl.LazyFrame({'col1': 'string '*1_000_000})
(test
 .with_columns(explode_me = pl.col('col1').str.split(' '))
 .with_columns(
     pl.col('explode_me').apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)], 
                                return_dtype= pl.List(pl.List(pl.Utf8))) 
  )
 .select(pl.col('explode_me'))
 .explode(pl.col('explode_me'))
 .sink_parquet('file.parquet')
)

Original Q&A

There are 1 best solutions below

Escobar West On 02 July 2023 at 00:31

You can use the list.slice method to chunk the list into smaller lists. The below takes your example (with only 10 strings) and chunks them into 5 columns of lists of two strings. I save the chunk names and expressions in the chunk_cols so you can set the chunking logic at run time with whatever logic you want.

chunk_cols = {f'chunk{i}': pl.col('explode_me').list.slice(2*i, 2) for i in range(5)}

test = pl.LazyFrame({'col1': ' '.join([f'string_{i}' for i in range (10)])})
(test
 .with_columns(explode_me = pl.col('col1').str.split(' '))
 .with_columns(
     **chunk_cols
 )
 .explode(pl.col('chunk1'))
 .collect(streaming=True)
 .select('col1','chunk1')
)

col1                chunk1
str                 str
"string_0 strin…    "string_2"
"string_0 strin…    "string_3"

Given a row with a list too big to explode(), how do I do a preparatory explode() to bring list size down to fit into memory?

There are 1 best solutions below

Related Questions in LIST

Related Questions in EXPLODE

Related Questions in PYTHON-POLARS

Trending Questions

Popular # Hahtags

Popular Questions