I have a pyspark dataframe that looks like this:
+----+--------------------------+--------------+
|id | score |review |
+----+--------------------------+--------------+
|1 |[83.52, 81.79, 84, 75] |[P,N,P,P] |
|2 |[86.13, 85.48] |[N,N,N,P] |
+----+--------------------------+--------------+
The schema of this dataframe is this:
root
|--id: int (nullable = false)
|--score: array (nullable = true)
| |-- element: string (containsNull = true)
|--review : array (nullable = true)
| |-- element: string (containsNull = true)
I want to find the mean of each column value of score and mode of each column value of review , and create new columns with those. The data-types of these new columns should be float and string respectively. Like this:
+----+--------------------------+--------------+---------+----------+
|id | score |review |scoreMean|reviewMode|
+----+--------------------------+--------------+---------+----------+
|1 |[83.52, 81.79, 84, 75] |[P,N,P,P] |81.08 |P |
|2 |[86.13, 85.48] |[N,N,N,P] |85.81 |N |
+----+--------------------------+--------------+---------+----------+
I have tried it using udf:
import statistics
import pyspark.sql.functions as F
#define UDFs
def mean_udf(data):
if len(data) == 0:
return None
data_float = [eval(i) for i in data]
return statistics.mean(data_float)
def mode_udf(data):
if len(data) == 0:
return None
return statistics.mode(data)
#register the UDFs
mean_func = F.udf(mean_udf)
mode_func = F.udf(mode_udf)
#apply UDFs
df= (df.withColumn("scoreMean", mean_func(F.col("score")))
.withColumn("reviewMode", mode_func(F.col("review")))
)
But it throws calculation failed error during evaluation (when I do show or collect).
Any help is welcome. Thanks.
I have tried a mixture of changing the
mean_scorecalculation using andgroup by&explodeexpressions and a custom udf to get themode_reviewas follows: