What is the difference between udf and vector udf in spark 3 as vectorized udf is new feature as per spark documentation
I know in In Spark 3, a user-defined function (UDF) is a function that you can define in a programming language such as Python or Scala and apply to data in Spark DataFrame or Dataset. A UDF takes one or more columns as input and produces a new column as output.
I experimented with simple unit test case and found that udf is simple on one element but vector udf operates on Array or struct
A vectorized UDF is a new feature in Spark 3 that is designed to improve the performance of UDFs by allowing them to process multiple rows at once, instead of processing one row at a time.
A vectorized UDF takes one or more columns as input and produces a new column as output, just like a regular UDF. However, instead of processing each row individually, a vectorized UDF operates on a batch of rows at once, which can result in significant performance improvements.
Note : not all UDFs can be vectorized. Only UDFs that operate on arrays or structs can be vectorized in Spark 3+
below is the full example :
output : regular udf
vectorized udf :
Conclusion :
The key difference between a UDF and a vectorized UDF in Spark 3 is that the latter is a new feature designed to improve the performance of UDFs by processing batches of rows at once instead of processing each row individually.