I have to create a column in a dataframe which tracks old values vs new values
I have two types of columns in the dataframe, one is sot (source of truth) another are normal columns (metrics).
Example value of the resultant column which is a comparison of both types of columns would look like
"{'template_name': {'old_value': '1-en_US-Travel-Guide-Hotels-citymc-Blossom-Desktop-Like-HSR', 'new_value': '1-en_US-HTG_CMC_BEXUS_SecondaryKW_Test_Variant'}, 'template_id': {'old_value': '14949', 'new_value': '37807'}, 'num_questions': {'old_value': 29.0, 'new_value': 28}, 'duplicate_questions': {'old_value': '[]', 'new_value': []}}"
If we want to do something similar with normal dictionary comprehension in python it looks like this
>>> metrics = [1,2,3,4,5,6]
>>> sot = [3,1,6,2,5,1]
>>> str({i: {"old_value":sot[i], "new_value": metrics[i]} for i in range(6) if metrics[i] != sot[i]})
"{0: {'old_value': 3, 'new_value': 1}, 1: {'old_value': 1, 'new_value': 2}, 2: {'old_value': 6, 'new_value': 3}, 3: {'old_value': 2, 'new_value': 4}, 5: {'old_value': 1, 'new_value': 6}}"
But I can't do something similar with spark dataframe
metrics_cols = extract_metrics_spark_df.columns
temp.withColumn("flagged", str({ i : {"old_value" : f.col("sot_"+i) , "new_value": f.col(i)} for i in metrics_cols if f.col(i) != f.col("sot_"+i) }))
I couldn't figure how I could also use a udf in this case
Any help trying to create the column is appreciated.
I think you could use a UDF for it, here is a scala example (pyspark should have similar functionality, I don't test the function)