Create column by repeating numbers from an array in pyspark

87 Views Asked by At

I have a pyspark dataframe with 30 rows and an array of 6 elements. The version of pyspark as taken from MS Fabric is 3.4.

Let's say the array is [5,4,3,4,1,0]. I need to create a column that repeats these 6 numbers 5 times. That is, it creates a column with elements [5, 4, 3, 4, 1, 0, 5, 4, 3, 4, 1, 0, 5, 4, 3, 4, 1, 0, ...] and column-bind it with initial dataframe.

The repeat function does not help because it repeats the full array as new arrays. It creates [5,4,3,4,1,0], [5,4,3,4,1,0], ...

How can I create this column?

1

There are 1 best solutions below

5
Ric S On

As of pyspark 2.4.0, you can use a combination of array_repeat and flatten to obtain the desired result:

import pyspark.sql.functions as F

df = df.withColumn('array_repeated', F.flatten(F.array_repeat('array', 5)))

Example

df = spark.createDataFrame([
  ([5,4,3,4,1,0], ),
], ['array'])

df = df.withColumn('array_repeated', F.flatten(F.array_repeat('array', 5)))

df.show(truncate=False)
+------------------+------------------------------------------------------------------------------------------+
|array             |array_repeated                                                                            |
+------------------+------------------------------------------------------------------------------------------+
|[5, 4, 3, 4, 1, 0]|[5, 4, 3, 4, 1, 0, 5, 4, 3, 4, 1, 0, 5, 4, 3, 4, 1, 0, 5, 4, 3, 4, 1, 0, 5, 4, 3, 4, 1, 0]|
+------------------+------------------------------------------------------------------------------------------+