Pyspark Cartesian product of two columns in a dataframe

28 Views Asked by At

I have a pyspark DataFrame that contains to columns, each one is an array of strings, how can I make a new column that is the cartesian product of them without splitting them to two dataframe and join them, and without a udf?

Example:

In df:
Df
+---+---+---+---+-
| a1    | a2     |
+---+---+---+---+-
|[1, 2]|[3, 4, 5]|
|[1, 2]|[7, 8]   |
+---+---+---+---+-

Out df:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| a1    | a2     | a3                                               |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|[1, 2]|[3, 4, 5]|[{1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}]  |
|[1, 2]|[7, 8]   |[{1, 7}, {1, 8}, {2, 7}, {2, 8}]                  |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
2

There are 2 best solutions below

0
Emma On BEST ANSWER

You can try nesting transform to create cartesian product.

This will result in a nested array and you can use flatten to get the final single array.

df = df.withColumn('a3', F.flatten(F.expr('transform(a1, x -> transform(a2, y -> (x, y)))')))

Result

+------+---------+------------------------------------------------+
|a1    |a2       |a3                                              |
+------+---------+------------------------------------------------+
|[1, 2]|[3, 4, 5]|[{1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}]|
|[1, 2]|[7, 8]   |[{1, 7}, {1, 8}, {2, 7}, {2, 8}]                |
+------+---------+------------------------------------------------+
0
user2704177 On

You could explode both array columns, adds a new column containing the set of a1 and a2 and then collect these sets and the exploded a1 and a2 to a list by aggregating again. Make sure that you have a column to distinguish the groups (e.g. a hash of a1 and a2) so you don't aggregate the duplicate a1 values into one.