Pyspark Dataframe Format for FPGrowth use -> The input column must be array, but got bigint

172 Views Asked by At

while trying to get Data from an XLSX into the right format for FPGrowth i face following errormessage when running model = fpGrowth.fit(pivotDF):

IllegalArgumentException: requirement failed: The input column must be array, but got bigint.

I take the data out of an XLSX file and read it in to a Pandas DataFrame and then convert it into a Spark Dataframe, do some cleaning and pivoting to get the desired table.

pivotDF.printSchema() shows this:

 |-- SalesTransactionID: long (nullable = true)
 |-- 0: long (nullable = true)
 |-- 1: long (nullable = true)
 |-- 2: long (nullable = true)
 |-- 3: long (nullable = true)
 |-- 4: long (nullable = true)
 |-- 5: long (nullable = true)
 |-- 6: long (nullable = true)
.... 

My Data (pivotDF) looks like this:

+------------------+---+---+---+---+---+---+---+---+---+---+
|SalesTransactionID|  0|  1|  2|  3|  4|  5|  6|  7|  8|  9|
+------------------+---+---+---+---+---+---+---+---+---+---+
|                 0|  0|  0|  0|  0|  0|  0|  0|  6|  6|  0|
|                 1|  0|  0|  0|  0|  0|  0|  0|  0|  3|  0|
|                 2|  0|  0|  0|  0|  0|  0|  2|  0|  0|  0|
|                 3|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|                 4|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
+------------------+---+---+---+---+---+---+---+---+---+---+

Is there any way to convert/cast this into the needed array Type column?

Many Thanks in advance

Edit: The goal i'm aiming for is something like this:

([(0, [7, 8]),
  (1, [8]), 
  (2, [6])], 
["id", "items"])
0

There are 0 best solutions below