Hash an array of strings in pyspark

52 Views Asked by caroline At 20 December 2023 at 10:36

I need to hash/categorize a column in a dataframe in pyspark.

df.printSchema()
root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- keys: array (nullable = true)
 |    |-- element: string (containsNull = true)

The dataframe looks like this

df.show()

+----+----+-----------------------------------------------------------+
|col1|col2|           keys                                            |
+----+----+-----------------------------------------------------------+
|   A|   b|array ["name:ck", "birth:FR", "country:FR", "job:Request"] |
|   B|   d|array ["name:cl", "birth:DE", "country:FR", "job:Request"] |                                   |   C|   d|array ["birth:FR", "name:ck", "country:FR", "job:Request"] |
+----+----+-----------------------------------------------------------+

However I am getting following error when trying:

df_hashed_1 = df\
    .withColumn('HashedID', sha2(col('keys'), 256))\
    .select('col1', 'col2', 'HashedID')

ERROR cannot resolve 'sha2(spark_catalog.default.posintegrationlogkeysevent.keys, 256)' due to data type mismatch: argument 1 requires binary type, however, 'spark_catalog.df.keys' is of array<string> type.;.

How could I hash/categorize this kind of column type?

I tried pyspark.sql.functions.sha2

Original Q&A

There are 1 best solutions below

boyangeor On 20 December 2023 at 14:05

sha2 expects string/binary column, you can concatenate the elements in the array:

from pyspark.sql import functions as F

_data = [
    (4, 'idA', ['name:ck', 'birth:FR', 'country:FR', 'job:Request'], ),
    (5, 'idA', ['name:cl', 'birth:DE', 'country:FR', 'job:Request'], ),
]
df = spark.createDataFrame(_data, ['col_a', 'col_b', 'keys'])

joined_array = F.array_join('keys', delimiter='')

sha_col = F.sha2(joined_array, 256)

cols = [
    F.col('col_a'),
    F.col('col_b'),
    sha_col.alias('hashed_id'),
]
df.select(cols).show(10, False)

# +-----+-----+----------------------------------------------------------------+
# |col_a|col_b|hashed_id                                                       |
# +-----+-----+----------------------------------------------------------------+
# |4    |idA  |fd9016141123b1a2b1f07bbc798a727293c0467a206f2a32096e5c310ebd6a26|
# |5    |idA  |7845f6f4fa706c7ed3748dd21924d192bd1b443797b2349f81144df1185f2bb6|
# +-----+-----+----------------------------------------------------------------+

Hash an array of strings in pyspark

There are 1 best solutions below

Related Questions in PYSPARK

Related Questions in HASH

Related Questions in SHA2

Trending Questions

Popular # Hahtags

Popular Questions