while trying to get Data from an XLSX into the right format for FPGrowth i face following errormessage when running model = fpGrowth.fit(pivotDF):
IllegalArgumentException: requirement failed: The input column must be array, but got bigint.
I take the data out of an XLSX file and read it in to a Pandas DataFrame and then convert it into a Spark Dataframe, do some cleaning and pivoting to get the desired table.
pivotDF.printSchema()
shows this:
|-- SalesTransactionID: long (nullable = true)
|-- 0: long (nullable = true)
|-- 1: long (nullable = true)
|-- 2: long (nullable = true)
|-- 3: long (nullable = true)
|-- 4: long (nullable = true)
|-- 5: long (nullable = true)
|-- 6: long (nullable = true)
....
My Data (pivotDF) looks like this:
+------------------+---+---+---+---+---+---+---+---+---+---+
|SalesTransactionID| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|
+------------------+---+---+---+---+---+---+---+---+---+---+
| 0| 0| 0| 0| 0| 0| 0| 0| 6| 6| 0|
| 1| 0| 0| 0| 0| 0| 0| 0| 0| 3| 0|
| 2| 0| 0| 0| 0| 0| 0| 2| 0| 0| 0|
| 3| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 4| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+------------------+---+---+---+---+---+---+---+---+---+---+
Is there any way to convert/cast this into the needed array Type column?
Many Thanks in advance
Edit: The goal i'm aiming for is something like this:
([(0, [7, 8]),
(1, [8]),
(2, [6])],
["id", "items"])