I have a requirement to find median for multiple double datatype columns.Request suggestion to find the correct approach.
Below is my sample dataset with one column. I am expecting the median value to be returned as 1 for my sample.
scala> sqlContext.sql("select num from test").show();
+---+
|num|
+---+
|0.0|
|0.0|
|1.0|
|1.0|
|1.0|
|1.0|
+---+
I tried the following options
1) Hive UDAF percentile, it worked only for BigInt.
2) Hive UDAT percentile_approx, but it does not work as expected (returns 0.25 vs 1).
sqlContext.sql("select percentile_approx(num,0.5) from test").show();
+----+
| _c0|
+----+
|0.25|
+----+
3) Spark window function percent_rank- to find median the way i see is to look for all percent_rank above 0.5 and pick the max percent_rank's corresponding num value. But it does not work in all cases, especially when i have even record counts, in such case the median is the average of the middle value in the sorted distribution.
Also in the percent_rank, as i have to find the median for multiple columns, i have to calculate it in different dataframes, which to me is little complex method. Please correct me, if my understanding is not right.
+---+-------------+
|num|percent_rank |
+---+-------------+
|0.0|0.0|
|0.0|0.0|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
+---+---+
Which version of Apache Spark are you using out of curiosity? There were some fixes within the Apache Spark 2.0+ which included changes to
approxQuantile.If I was to run the pySpark code snippet below:
with the
mediancalculation usingapproxQuantileas:or
the results are:
Note, as these are the approximate numbers (via
approxQuantile) though in general this should work well. If you need the exact median, one approach is to usenumpy.median. The code snippet below is updated for thisdfexample based on gench's SO response to How to find the median in Apache Spark with Python Dataframe API?:with the output of:
Updated: Spark 1.6 Scala version using RDDs
If you are using Spark 1.6, you can calculate the
medianusing Scala code via Eugene Zhulenev's response How can I calculate the exact median with Apache Spark. Below is the modified code that works with our example.with the output of:
Note, this is calculating the exact median using
RDDs- i.e. you will need to convert the DataFrame column into an RDD to perform this calculation.