Binning in R with NA Group

42 Views Asked by At

I've been using the following function to create even bin variables:

## Even Bins Funtion
evenbins <- function(x, bin.count = 5, order = T) {
  bin.size <- rep(length(x) %/% bin.count, bin.count)
  bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1,0)
  bin <- rep(1:bin.count, bin.size)
  if(order) {
    bin <- bin[rank(x, ties.method = "random")]
  }
  return(factor(bin, levels = 1:bin.count, ordered = order))
}

This works great to bin the numeric values, however, it groups NA's into the final group (in this case the 5th bin). So it does something like this if pivoted:

Current Output

I'd like to tweak the function to remove the NA's from initial binning funciton and keep them as NA values so when I group the bin column it yields this:

Desired output

Thanks in advance for reading and any help!!

SAMPLE CODE to work w/:

##set up fake dataset

df1 <- data.frame(x = c(1:450))

df2 <- data.frame(x = 1:50)
df2$x <- NA

df3 <- rbind (df1, df2 )


## Even Bins Funtion
evenbins <- function(x, bin.count = 5, order = T) {
  bin.size <- rep(length(x) %/% bin.count, bin.count)
  bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1,0)
  bin <- rep(1:bin.count, bin.size)
  if(order) {
    bin <- bin[rank(x, ties.method = "random")]
  }
  return(factor(bin, levels = 1:bin.count, ordered = order))
}

df3$Bin <- evenbins(df3$x)
df3$isNA <- ifelse(is.na(df3$x) == TRUE, "# NA","complete")


t1 <- cbind(
  table(df3$Bin)
  ,table(df3$Bin, df3$isNA)
)
2

There are 2 best solutions below

0
Gregor Thomas On

Here's a simple modification - count the NAs, remove them, and then tack them on again at the end:

evenbins <- function(x, bin.count = 5, order = T) {
  n_na = sum(is.na(x))
  x = na.omit(x)
  bin.size <- rep(length(x) %/% bin.count, bin.count)
  bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1,0)
  bin <- rep(1:bin.count, bin.size)
  if(order) {
    bin <- bin[rank(x, ties.method = "random")]
  }
  return(factor(c(bin, rep(NA, n_na)), levels = 1:bin.count, ordered = order))
}

df3 <- rbind (df1, df2 )
df3$Bin <- evenbins(df3$x)
df3$isNA <- ifelse(is.na(df3$x), "# NA","complete")
cbind(
  table(df3$Bin, useNA = "always")
  ,table(df3$Bin, df3$isNA, useNA = "always")
)
#         # NA complete <NA>
# 1    90    0       90    0
# 2    90    0       90    0
# 3    90    0       90    0
# 4    90    0       90    0
# 5    90    0       90    0
# <NA> 50   50        0    0
0
IRTFM On

Here's a fairly simple base solution:

as.data.frame(  table( (df3+100) %/% 100, useNA="always")  , make.names = TRUE)
     x Freq
1    1   99
2    2  100
3    3  100
4    4  100
5    5   51
6 <NA>   50

The key trick is to get the NA's counted by adding the useNA parameter to table. The +100 is just to deliver the values as you labeled them beginning with 1. The make.names parameter to as.data.frame move the rownames of the table-object over to a label column.