In my DATA, I want to censor ns that are smaller than 10 and percents <.05 or >.95 (censoring criteria).
After applying the censoring criteria above:
Our goal is to not allow users to use one of the columns (ex.
n) to figure out the other column (ex.percent) and vice versa.
But there is an issue. These percents are based on each pair of Met values (Yes, No).
For example, the percent on the first row (.791) is from n of the first row divided by sum of the ns from the first row (Met=="No) and second row (Met=="Yes").
Because of the relation between the pairs, censoring based on my censoring criteria alone, doesn't ensure that we achieve our goal above (see my unsuccessful trial below).
Is there a way to achieve my Desire_output below?
P.S. You may ask, in the Desired_output, why 65 on row 5 shouldn't be censored? Because percent on rows 5 and 6 are both already censored, so any n > 10 can stay as is, since no one can figure n on row 6 (i.e., n==1) by looking at n at row 5 (i.e., n==65).
library(tidyverse)
library(scales)
DATA <- read.table(h=T,text=
" Ethnic_overall EL_Type Met n percent
AmerInd NeverEL No 704 0.791
AmerInd NeverEL Yes 186 0.209
AmerInd Current No 65 0.985
AmerInd Current Yes 1 0.015
AmerInd Former No 50 0.847
AmerInd Former Yes 9 0.153")
# Doesn't produce Desired_output
mutate(DATA, percent_new =
ifelse(percent < .05, "<5%", ifelse(percent > .95, ">95%", percent(percent))),
n = ifelse(n < 10 | percent %in% c("<5%",">95%"), "*", n) )
# Current unsuccessful output:
Ethnic_overall EL_Type Met n percent_new
1 AmerInd NeverEL No 704 79.1%
2 AmerInd NeverEL Yes 186 20.9%
3 AmerInd Current No 65 >95%
4 AmerInd Current Yes * <5%
5 AmerInd Former No 50 84.7%
6 AmerInd Former Yes * 15.3%
# Desired_output:
Ethnic_overall EL_Type Met n percent_new
1 AmerInd NeverEL No 704 79.1%
2 AmerInd NeverEL Yes 186 20.9%
3 AmerInd Current No 65 >95%
4 AmerInd Current Yes * <5%
5 AmerInd Former No * 84.7%
6 AmerInd Former Yes * 15.3%
I think the issues are 2 fold. One, you have multiple "if" conditions for
ifelse(), which I believe only accepts a binary syllogism. Second, you are trying to combine numeric and character datatypes in a single column, which "dplyr" will throw a fit about. You can overcome #1 by usingcase_when()instead ofifelse(),as the former accepts multiple conditions to test before writing the default value. To overcome the second challenge, simply coerce the default value to a character variable usingas.character().Here is the full solution, which gives your desired output.Here is the output: