We have to compute percentiles for 100 columns in an data frame. In the example below, the column names that need percentiles are pctile_columns. The criteria for receiving percentiles is (1) the column is not NA, and (2) the min_pg column is >= 12. We are struggling to obtain the correct set of percentiles:
Data + Attempt
temp_df = structure(list(group_var = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
min_pg = c(11, 15, 19, 7, 5, 34, 32, 27, 24, 18, 13, 10),
stat1 = c(0.35, 0.32, 0.27, NA, NA, 0.42, 0.45, 0.47, 0.33, NA, 0.24, 0.39)),
row.names = c(NA, -12L), class = "data.frame")
library(dplyr)
pctile_columns <- c('stat1')
temp_output <- temp_df %>%
group_by(group_var) %>%
mutate(across(.cols = all_of(pctile_columns),
.fns = ~ if_else(is.na(.) | min_pg < 12, as.numeric(NA),
rank(., ties.method = "max")),
.names = "{.col}__rank")) %>%
mutate(across(.cols = all_of(pctile_columns),
.fns = ~ if_else(is.na(.) | min_pg < 12, as.numeric(NA),
round((rank(., ties.method = "max") - 1) / (n() - 1) * 100, 0)),
.names = "{.col}__pctile"))
Output
# Groups: group_var [1]
group_var min_pg stat1 stat1__rank stat1__pctile
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 0.35 NA NA
2 1 15 0.32 3 18
3 1 19 0.27 2 9
4 1 7 NA NA NA
5 1 5 NA NA NA
6 1 34 0.42 7 55
7 1 32 0.45 8 64
8 1 27 0.47 9 73
9 1 24 0.33 4 27
10 1 18 NA NA NA
11 1 13 0.24 1 0
12 1 10 0.39 NA NA
The problem with this output is that the ranks go from 1-9, whereas they should go from 1-7. Even though the stat1 values with min_pg < 12 are correctly being assigned an NA value, these stat1 values are still being factored into the rank equation when computing the ranks for all of the other rows. The correct set of ranks should be 1-7 in this instance, as there are 7 metrics that meet the criteria for stat1 to receive a rank/percentile.
How can we revise our code to compute ranks/percentiles properly per our criteria?
You could write a
statfunand use it inby.Data: