I need to hash/categorize a column in a dataframe in pyspark.
df.printSchema()
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- keys: array (nullable = true)
| |-- element: string (containsNull = true)
The dataframe looks like this
df.show()
+----+----+-----------------------------------------------------------+
|col1|col2| keys |
+----+----+-----------------------------------------------------------+
| A| b|array ["name:ck", "birth:FR", "country:FR", "job:Request"] |
| B| d|array ["name:cl", "birth:DE", "country:FR", "job:Request"] | | C| d|array ["birth:FR", "name:ck", "country:FR", "job:Request"] |
+----+----+-----------------------------------------------------------+
However I am getting following error when trying:
df_hashed_1 = df\
.withColumn('HashedID', sha2(col('keys'), 256))\
.select('col1', 'col2', 'HashedID')
ERROR cannot resolve 'sha2(spark_catalog.default.posintegrationlogkeysevent.keys, 256)' due to data type mismatch: argument 1 requires binary type, however, 'spark_catalog.df.keys' is of array<string> type.;.
How could I hash/categorize this kind of column type?
I tried pyspark.sql.functions.sha2
sha2expects string/binary column, you can concatenate the elements in the array: