Given an example dataframe where we have column 'b' containing lists, and each list has the same length (so it also could be converted to arrays)
df_test = pl.DataFrame({'a': [1., 2., 3.], 'b': [[2,2,2], [3,3,3], [4,4,4]]})
df_test
shape: (3, 2)
┌─────┬───────────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1.0 ┆ [2, 2, 2] │
│ 2.0 ┆ [3, 3, 3] │
│ 3.0 ┆ [4, 4, 4] │
└─────┴───────────┘
How do I end up with
shape: (3, 3)
┌─────┬───────────┬────────────────────┐
│ a ┆ b ┆ new │
│ --- ┆ --- ┆ --- │
│ f64 ┆ list[i64] ┆ list[f64] │
╞═════╪═══════════╪════════════════════╡
│ 1.0 ┆ [2, 2, 2] ┆ [2.0, 2.0, 2.0] │
│ 2.0 ┆ [3, 3, 3] ┆ [6.0, 6.0, 6.0] │
│ 3.0 ┆ [4, 4, 4] ┆ [12.0, 12.0, 12.0] │
└─────┴───────────┴────────────────────┘
without using map_rows?
The best way I could think of was to use map_rows, which is like apply in pandas. Not really the most efficient thing according to docs but it works:
df_temp = df_test.map_rows(lambda x: ([x[0] * i for i in x[1]],))
df_temp.columns = ['new']
df_test = df_test.hstack(df_temp)
Edit: adjusted answer to make sure that it works with duplicate values in column 'a'.
Here's one approach:
Data
N.B. Below changing 'a' from
[1., 2., 3.]to[1., 1., 3.]to exemplify the need for an extra temporary column 'idx' for the groupby.Code
Explanation
pl.DataFrame.with_columns,pl.arange, andpl.len) to keep track of each row. I.e., we use this column to differentiate between rows that have the same value in 'a'.pd.DataFrame.explodeto get the list values for 'b' into separate rows.pl.DataFrame.with_columnsto multiply column 'a' by column 'b', assigning the result to 'new'.pl.DataFrame.group_byon columns 'idx' and 'a', addingmaintain_order=Trueto keep the data in the correct order, and applygroupby.aggon columns 'b' and 'new'.pl.DataFrame.drop).