Create an indicator variable in one data frame based on values in another data frame

222 Views Asked by At

Say, I have a dataset called iris. I want to create an indicator variable called sepal_length_group in this dataset. The values of this indicator will be p25, p50, p75, and p100. For example, I want sepal_length_group to be equal to "p25" for an observation if the Species is "setosa" and if the Sepal.Length is equal to or less than the 25th percentile for all species classified as "setosa". I wrote the following codes, but it generates all NAs:

library(skimr)

sepal_length_distribution <- iris %>% group_by(Species) %>% skim(Sepal.Length) %>% select(3, 9:12)

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),2], "p25", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),2] &
                                                Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),3], "p50", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),3] &
                                                        Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),4], "p75", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),4] &
                                                        Sepal.Length < sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),5], "p100", NA))

Any help will be highly appreciated!

1

There are 1 best solutions below

1
Onyambu On BEST ANSWER

This could be done simply by the use of the function cut as commented by @Camille

library(tidyverse)
iris %>%
  group_by(Species) %>%
  mutate(cat = cut(Sepal.Length, 
                   quantile(Sepal.Length, c(0,.25,.5,.75, 1)),
                   paste0('p', c(25,50, 75, 100)), include.lowest = TRUE))