How do I encapsulate an element into list without using an `.apply` in Polars?

83 Views Asked by At

A follow-on from my question about preparatory .explode for rows too big for regular explode

How do I appropriately encapsulate a list into another list without using an .apply?

I only want to use the slow .apply on the one or two rows which necessitate it by being too large for a regular explode. But when I try to use .when/.then, I don't know what to put in place of the None below to overcome the mismatch between the normal value pl.List(pl.Utf8) and prep'd pl.List(pl.List(pl.Utf8) (for the double .explode later).

Ideally, it'd be something like list(pl.col('explode_me'))

This works:

import polars as pl

df = pl.LazyFrame({'col1': ['string '*1_000_000, 'a b c']})
(df
 .with_columns(explode_me = pl.col('col1').str.split(' '))
 .with_columns(
     (pl.when(pl.col('explode_me').list.lengths() <= 4)
        .then(None)
        .otherwise(pl.col('explode_me').apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)]), 
                                              return_dtype= pl.List(pl.List(pl.Utf8)))))
 .explode('explode_me').explode('explode_me')
 .collect(streaming=True)
)

I'm sure it's something simple, but I can't for the life of me get it to work. I've tried fiddling with concat, pl.List, implode, cast...

1

There are 1 best solutions below

0
Thomas On

This is my current solution, which may be a bit roundabout because I split the columns, and then coalesce them back together once they're the same type:

import polars as pl

df = pl.LazyFrame({'col1': ['string '*1_000_000, 'a b c']})
(df
 .with_columns(explode_me = pl.col('col1').str.split(' '))
 .with_columns(part1 = (pl.when(pl.col('explode_me').list.lengths() >= 4)
                          .then(pl.col('explode_me').apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)], 
                                                           return_dtype= pl.List(pl.List(pl.Utf8))))),
               part2 = (pl.when(pl.col('explode_me').list.lengths() <= 4)
                          .then(pl.col('explode_me'))),
               )
.explode('part1')
.with_columns(pl.coalesce(['part1', 'part2']).alias('combined'))
.select('combined')
.explode('combined')
.collect(streaming=True)
)