Whenever I want to get distributions in pandas for my entire dataset I just run the following basic code:
x.groupby('y').describe(percentiles=[.1, .25, .5, .75, .9, 1])
where I get the distribution values for every custom percentage I want. I want to do the exact same thing in pyspark. However, from what I have read the describe function in pyspark does not allow to specify percentages and the summary function in pyspark only allows for standard values of 0.25, 0.50, 0.75 so I can't customize to the percentiles I would like.
How do I do the equivalent of the pandas code below but in pyspark?
You can use percentile_approx on all column names you need (note that we drop the column we are performing the groupby on):
For anyone using an earlier version of Pyspark, you can calculate percentiles using
F.expr(credit goes this answer by @Ala Tarighati):Using a random sample pyspark dataframe: