Censoring data in a dataframe conditionally in R

45 Views Asked by At

In my DATA, I want to censor ns that are smaller than 10 and percents <.05 or >.95 (censoring criteria).

After applying the censoring criteria above:

Our goal is to not allow users to use one of the columns (ex. n) to figure out the other column (ex. percent) and vice versa.

But there is an issue. These percents are based on each pair of Met values (Yes, No).

For example, the percent on the first row (.791) is from n of the first row divided by sum of the ns from the first row (Met=="No) and second row (Met=="Yes").

Because of the relation between the pairs, censoring based on my censoring criteria alone, doesn't ensure that we achieve our goal above (see my unsuccessful trial below).

Is there a way to achieve my Desire_output below?

P.S. You may ask, in the Desired_output, why 65 on row 5 shouldn't be censored? Because percent on rows 5 and 6 are both already censored, so any n > 10 can stay as is, since no one can figure n on row 6 (i.e., n==1) by looking at n at row 5 (i.e., n==65).

library(tidyverse)
library(scales)

DATA <- read.table(h=T,text=
" Ethnic_overall  EL_Type  Met  n   percent
         AmerInd  NeverEL  No 704   0.791
         AmerInd  NeverEL Yes 186   0.209
         AmerInd  Current  No  65   0.985
         AmerInd  Current Yes   1   0.015
         AmerInd   Former  No  50   0.847
         AmerInd   Former Yes   9   0.153")

# Doesn't produce Desired_output
mutate(DATA, percent_new =
         ifelse(percent < .05, "<5%", ifelse(percent > .95, ">95%", percent(percent))),
       n = ifelse(n < 10 | percent %in% c("<5%",">95%"), "*", n) )

# Current unsuccessful output:
  Ethnic_overall EL_Type  Met   n percent_new
1        AmerInd NeverEL  No 704       79.1%
2        AmerInd NeverEL Yes 186       20.9%
3        AmerInd Current  No  65        >95%
4        AmerInd Current Yes   *         <5%
5        AmerInd  Former  No  50       84.7%
6        AmerInd  Former Yes   *       15.3%

# Desired_output:
  Ethnic_overall EL_Type Met   n percent_new
1        AmerInd NeverEL  No 704       79.1%
2        AmerInd NeverEL Yes 186       20.9%
3        AmerInd Current  No  65        >95%
4        AmerInd Current Yes  *          <5%
5        AmerInd  Former  No  *        84.7%
6        AmerInd  Former Yes  *        15.3%
1

There are 1 best solutions below

4
Sean McKenzie On

I think the issues are 2 fold. One, you have multiple "if" conditions for ifelse(), which I believe only accepts a binary syllogism. Second, you are trying to combine numeric and character datatypes in a single column, which "dplyr" will throw a fit about. You can overcome #1 by using case_when() instead of ifelse(), as the former accepts multiple conditions to test before writing the default value. To overcome the second challenge, simply coerce the default value to a character variable using as.character(). Here is the full solution, which gives your desired output.

library(scales)
library(tidyverse)

DATA <- read.table(h=T,text=
                     " Ethnic_overall  EL_Type  Met  n   percent
         AmerInd  NeverEL  No 704   0.791
         AmerInd  NeverEL Yes 186   0.209
         AmerInd  Current  No  65   0.985
         AmerInd  Current Yes   1   0.015
         AmerInd   Former  No  50   0.847
         AmerInd   Former Yes   9   0.153")

mutate(DATA, percent_new =
         case_when(percent < .05 ~ "<5%", 
                   percent > .95~ ">95%", 
                   TRUE~as.character(percent))) %>% 
      mutate(n = case_when(n < 10~"*",
                     percent_new %in% c("<5%",">95%") ~ "*",
                     TRUE~as.character(n)), test_met=lead(Met), test_n=lead(n)) %>% 
   mutate(n=case_when(test_n=="*" & test_met=="Yes"~"*",
                      TRUE~n)) %>%  select(-c(test_n, test_met))

Here is the output:

  Ethnic_overall EL_Type Met   n percent percent_new
1        AmerInd NeverEL  No 704   0.791       0.791
2        AmerInd NeverEL Yes 186   0.209       0.209
3        AmerInd Current  No   *   0.985        >95%
4        AmerInd Current Yes   *   0.015         <5%
5        AmerInd  Former  No   *   0.847       0.847
6        AmerInd  Former Yes   *   0.153       0.153