How to run UDF in parallel with pyspark data frame on more than one cols

1k Views Asked by Dhimant vyas At 13 September 2022 at 05:38

I have pyspark dataframe with 4 columns

1) Country
2) col1 [numeric]
3) col2 [numeric]
4) col3 [numeric]

I have udf which takes number and formats it to xx.xx [ 2 decimal points] using "withColumn" function I can call udf and format the numbers.

Example :

df=df.withColumn("col1", num_udf(df.col1))
df=df.withColumn("col2", num_udf(df.col2))
df=df.withColumn("col3", num_udf(df.col3))

What i m looking for can we run this udfs on each col parallelly, instead running in sequence.

Original Q&A

There are 2 best solutions below

Jonathan On 13 September 2022 at 07:10

Not sure why do you want to run it in parallel, but you can achieve it by using rdd and map:

temp = spark.createDataFrame(
    [(1, 2, 3)],
    schema=['col1', 'col2', 'col3']
)

temp.show(3, False)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |2   |3   |
+----+----+----+

# You can replace +1 to your udf in the lambda
temp = temp.rdd.map(
    lambda row: (row[0]+ 1, row[1] + 1, row[2] + 1)
).toDF(['col1', 'col2', 'col3'])

temp.show(3, False)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|2   |3   |4   |
+----+----+----+

Abdelkader Adda Bousekrane On 13 September 2022 at 08:18

You can also create udf function from a python function like below:

from pyspark.sql.functions import udf
def formatNumber(x):
    if x is not None :
        return "%0.2f"%x
    else:
        return None

formatNumberUdf = udf(formatNumber)

df=df.withColumn("col1", formatNumberUdf('col1'))
df=df.withColumn("col2", formatNumberUdf('col2'))
df=df.withColumn("col3", formatNumberUdf('col3'))

How to run UDF in parallel with pyspark data frame on more than one cols

There are 2 best solutions below

Related Questions in PYSPARK

Related Questions in USER-DEFINED-FUNCTIONS

Related Questions in PARALLEL-EXECUTION

Trending Questions

Popular # Hahtags

Popular Questions