Function to cut multiple columns into roughly equal ordered groups in R

134 Views Asked by At

I am trying to create a function that will add columns that order multiple columns and allocate data into four groups with equal numbers of observations (these will constitute four risk levels). I understand that where there are multiple zeroes this may not be possible, but would like this to work otherwise.

I need to be able to do this with mutate_at as I have 200 different score columns

Sample dataset below. I have used the cut function, the cut function produces groups of unequal sizes even where there are not multiple 0 values

library(tidyverse)
dat <- data.frame(Suburb = c("Werribee", "Hoppers", "Carlton", "North", "Serrano", "Upwell","Market", "Poynton", "Stawell"),
                    scores_1 = c(0,0.0001, 230, 340, 340, 10, 3490, 5,21),
                    scores_2 = c(12, 0,0,0,0,0,0, 45,13),
                    scores_3 = c(4,5,6,7,8,9,10, 45, 45))

allscores <- c("scores_1", "scores_2", "scores_3")

myresult <- dat %>% 
  mutate_at(.vars = allscores, 
          .funs = list(risk_levels = ~ as.integer(cut(as.numeric(.), breaks = 4))))

I have also tried to use cut_numberfunction but have gotten the following error:

risk.data2 <- dat %>% 
  mutate_at(.vars = allscores, 
            .funs = list(risk_levels = ~ as.integer(cut_number(as.numeric(.), n = 4))))
Error in `mutate_cols()`:
! Problem with `mutate()` column `scores_2_risk_levels`.
i `scores_2_risk_levels = (structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
x Insufficient data values to produce 4 bins.
Caused by error in `cut_number()`:
! Insufficient data values to produce 4 bins.

How can I use tidyverse to order and then group my many score values into four roughly equal size groups?

2

There are 2 best solutions below

0
Darren Tsai On BEST ANSWER

You can convert each column into ranks before cut().

dat %>% 
  mutate(across(starts_with("scores"),
                ~ cut(dense_rank(.x), 4, labels = FALSE),
                .names = "{sub('scores', 'risk_levels', .col)}"))

#     Suburb scores_1 scores_2 scores_3 risk_levels_1 risk_levels_2 risk_levels_3
# 1 Werribee 0.00e+00       12        4             1             2             1
# 2  Hoppers 1.00e-04        0        5             1             1             1
# 3  Carlton 2.30e+02        0        6             3             1             2
# 4    North 3.40e+02        0        7             4             1             2
# 5  Serrano 3.40e+02        0        8             4             1             3
# 6   Upwell 1.00e+01        0        9             2             1             3
# 7   Market 3.49e+03        0       10             4             1             4
# 8  Poynton 5.00e+00       45       45             2             4             4
# 9  Stawell 2.10e+01       13       45             3             3             4
0
Rui Barradas On

Which one do you want?

suppressPackageStartupMessages(
  library(tidyverse)
)
class <- data.frame(Suburb = c("Werribee", "Hoppers", "Carlton", "North", "Serrano", "Upwell","Market", "Poynton", "Stawell"),
                    scores_1 = c(0,0.0001, 230, 340, 340, 10, 3490, 5,21),
                    scores_2 = c(12, 0,0,0,0,0,0, 45,13),
                    scores_3 = c(4,5,6,7,8,9,10, 45, 45))


f <- function(x) {
  cut(x, breaks = 4L, labels = FALSE)
}
g <- function(x) {
  qq <- quantile(x)
  findInterval(x, qq, rightmost.closed = TRUE, all.inside = TRUE)
}

class %>%
  mutate(across(starts_with("scores"), list(f), 
                .names = "risk_levels_{sub('scores_', '', col)}"))
#>     Suburb scores_1 scores_2 scores_3 risk_levels_1 risk_levels_2 risk_levels_3
#> 1 Werribee 0.00e+00       12        4             1             2             1
#> 2  Hoppers 1.00e-04        0        5             1             1             1
#> 3  Carlton 2.30e+02        0        6             1             1             1
#> 4    North 3.40e+02        0        7             1             1             1
#> 5  Serrano 3.40e+02        0        8             1             1             1
#> 6   Upwell 1.00e+01        0        9             1             1             1
#> 7   Market 3.49e+03        0       10             4             1             1
#> 8  Poynton 5.00e+00       45       45             1             4             4
#> 9  Stawell 2.10e+01       13       45             1             2             4

class %>%
  mutate(across(starts_with("scores"), list(g), 
                .names = "risk_levels_{sub('scores_', '', col)}"))
#>     Suburb scores_1 scores_2 scores_3 risk_levels_1 risk_levels_2 risk_levels_3
#> 1 Werribee 0.00e+00       12        4             1             4             1
#> 2  Hoppers 1.00e-04        0        5             1             3             1
#> 3  Carlton 2.30e+02        0        6             3             3             2
#> 4    North 3.40e+02        0        7             4             3             2
#> 5  Serrano 3.40e+02        0        8             4             3             3
#> 6   Upwell 1.00e+01        0        9             2             3             3
#> 7   Market 3.49e+03        0       10             4             3             4
#> 8  Poynton 5.00e+00       45       45             2             4             4
#> 9  Stawell 2.10e+01       13       45             3             4             4

Created on 2023-08-01 with reprex v2.0.2


Note

The quantile/findInterval may give unexpected results since the quantiles may be equal, see what happens with scores_2:

class %>% pull(scores_2) %>% quantile()
#>   0%  25%  50%  75% 100% 
#>    0    0    0   12   45

Created on 2023-08-01 with reprex v2.0.2

In this case the breaks are not evenly spaced and all scores_2 will fall in the last two intervals. You have two classes only, [0, 12) and [12, 45]. This is probably not what you want.