Why does MutableAggregationBuffer in UserDefinedAggregateFunction require a bufferSchema?

232 Views Asked by At

I am looking into implementing a UserDefinedAggregateFunction in spark and see that a bufferSchema is needed. I understand how to create it, but my issue is why does it require a bufferSchema? Should it not only need a size (number of elements for use in aggregation), an inputSchema and a dataType? Doesn't a bufferSchema constrain it to UserDefinedTypes in the intermediate steps in sql?

1

There are 1 best solutions below

1
Raphael Roth On

This is needed because the buffer schema can differ from the input type. For example if you want to calculate the average (arithmetic mean) of doubles, the buffer needs a count and a sum in this case See e.g. the example from databricks how to calculate the geometric mean : https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html