Censoring data in a dataframe conditionally in R

Question

Censoring data in a dataframe conditionally in R

45 Views Asked by Simon Harmel At 15 March 2024 at 21:54

In my DATA, I want to censor ns that are smaller than 10 and percents <.05 or >.95 (censoring criteria).

After applying the censoring criteria above:

Our goal is to not allow users to use one of the columns (ex. n) to figure out the other column (ex. percent) and vice versa.

But there is an issue. These percents are based on each pair of Met values (Yes, No).

For example, the percent on the first row (.791) is from n of the first row divided by sum of the ns from the first row (Met=="No) and second row (Met=="Yes").

Because of the relation between the pairs, censoring based on my censoring criteria alone, doesn't ensure that we achieve our goal above (see my unsuccessful trial below).

Is there a way to achieve my Desire_output below?

P.S. You may ask, in the Desired_output, why 65 on row 5 shouldn't be censored? Because percent on rows 5 and 6 are both already censored, so any n > 10 can stay as is, since no one can figure n on row 6 (i.e., n==1) by looking at n at row 5 (i.e., n==65).

library(tidyverse)
library(scales)

DATA <- read.table(h=T,text=
" Ethnic_overall  EL_Type  Met  n   percent
         AmerInd  NeverEL  No 704   0.791
         AmerInd  NeverEL Yes 186   0.209
         AmerInd  Current  No  65   0.985
         AmerInd  Current Yes   1   0.015
         AmerInd   Former  No  50   0.847
         AmerInd   Former Yes   9   0.153")

# Doesn't produce Desired_output
mutate(DATA, percent_new =
         ifelse(percent < .05, "<5%", ifelse(percent > .95, ">95%", percent(percent))),
       n = ifelse(n < 10 | percent %in% c("<5%",">95%"), "*", n) )

# Current unsuccessful output:
  Ethnic_overall EL_Type  Met   n percent_new
1        AmerInd NeverEL  No 704       79.1%
2        AmerInd NeverEL Yes 186       20.9%
3        AmerInd Current  No  65        >95%
4        AmerInd Current Yes   *         <5%
5        AmerInd  Former  No  50       84.7%
6        AmerInd  Former Yes   *       15.3%

# Desired_output:
  Ethnic_overall EL_Type Met   n percent_new
1        AmerInd NeverEL  No 704       79.1%
2        AmerInd NeverEL Yes 186       20.9%
3        AmerInd Current  No  65        >95%
4        AmerInd Current Yes  *          <5%
5        AmerInd  Former  No  *        84.7%
6        AmerInd  Former Yes  *        15.3%

Original Q&A

There are 1 best solutions below

**Sean McKenzie** · Answer 1 · 2024-03-15T22:42:03.103000

I think the issues are 2 fold. One, you have multiple "if" conditions for ifelse(), which I believe only accepts a binary syllogism. Second, you are trying to combine numeric and character datatypes in a single column, which "dplyr" will throw a fit about. You can overcome #1 by using case_when() instead of ifelse(), as the former accepts multiple conditions to test before writing the default value. To overcome the second challenge, simply coerce the default value to a character variable using as.character(). Here is the full solution, which gives your desired output.

library(scales)
library(tidyverse)

DATA <- read.table(h=T,text=
                     " Ethnic_overall  EL_Type  Met  n   percent
         AmerInd  NeverEL  No 704   0.791
         AmerInd  NeverEL Yes 186   0.209
         AmerInd  Current  No  65   0.985
         AmerInd  Current Yes   1   0.015
         AmerInd   Former  No  50   0.847
         AmerInd   Former Yes   9   0.153")

mutate(DATA, percent_new =
         case_when(percent < .05 ~ "<5%", 
                   percent > .95~ ">95%", 
                   TRUE~as.character(percent))) %>% 
      mutate(n = case_when(n < 10~"*",
                     percent_new %in% c("<5%",">95%") ~ "*",
                     TRUE~as.character(n)), test_met=lead(Met), test_n=lead(n)) %>% 
   mutate(n=case_when(test_n=="*" & test_met=="Yes"~"*",
                      TRUE~n)) %>%  select(-c(test_n, test_met))

Here is the output:

  Ethnic_overall EL_Type Met   n percent percent_new
1        AmerInd NeverEL  No 704   0.791       0.791
2        AmerInd NeverEL Yes 186   0.209       0.209
3        AmerInd Current  No   *   0.985        >95%
4        AmerInd Current Yes   *   0.015         <5%
5        AmerInd  Former  No   *   0.847       0.847
6        AmerInd  Former Yes   *   0.153       0.153

Censoring data in a dataframe conditionally in R

There are 1 best solutions below

Related Questions in R

Related Questions in DATAFRAME

Related Questions in IF-STATEMENT

Related Questions in DPLYR

Related Questions in TIDYVERSE

Trending Questions

Popular # Hahtags

Popular Questions