R How To Set Up Cut Function

51 Views Asked by At
set.seed(1)
DATA = data.frame(X = sample(c(0:100), 1000, replace = TRUE))
DATA$CUT = with(DATA, cut(X, breaks = c(10,20,30,40,50,60,70,80,90), right = FALSE))

I wish to get groups: 0-9, 10-19, 20-29,..,80-89, 90+ but no matter how I do cut function I do not get these breaks.

3

There are 3 best solutions below

0
MrFlick On BEST ANSWER

You need to include the extreme bounds. For example

breaks <- c(0,10,20,30,40,50,60,70,80,90, Inf)
DATA <- transform(DATA, CUT=cut(X, breaks=breaks, right = FALSE))

which results in

table(DATA$CUT)
#   [0,10)  [10,20)  [20,30)  [30,40)  [40,50)  [50,60)  [60,70)  [70,80)  [80,90) [90,Inf) 
#     102       84       96      102       96      102       90       94       122      112 

Since cut() usually expects continuous values and not counts, if you have integers, [0,10) is the same as [0,9] or 0-9

If you want to set the labels, you can do

breaks <- c(0,10,20,30,40,50,60,70,80,90, Inf)
labels <- paste(head(breaks, -1), tail(breaks, -1)-1, sep="-")
DATA <- transform(DATA, CUT=cut(X, breaks=breaks, labels=labels, right = FALSE))

which now results in

table(DATA$CUT)
#    0-9  10-19  20-29  30-39  40-49  50-59  60-69  70-79  80-89 90-Inf 
#    102     84     96    102     96    102     90     94    122    112 
2
Jinjin On

It is hard to check the processed data using with(), so I would go for within() to create a new column bin. Also, instead of hardcode c(0,10,...inf), I would define binwidth dynamically using the quotient of X divided by 10, thereby being flexible and compatibible.

> within(df, bin <- cut(X, breaks = c(X%/%10%>%unique(), (max(X)%/%10+1))*10, right=F))
      X       bin
1    26   [20,30)
2    37   [30,40)
3    57   [50,60)
4    91  [90,100)
5    20   [20,30)
6    90  [90,100)
7    95  [90,100)
8    66   [60,70)
9    63   [60,70)
10    6    [0,10)
11   20   [20,30)
12   17   [10,20)
13   69   [60,70)
14   38   [30,40)
15   77   [70,80)
16   50   [50,60)
17   72   [70,80)
18  100 [100,110)
...
##check NA
within(df, bin <- cut(X, breaks = c(X%/%10%>%unique()*10, (max(X)%/%10+1)*10), right=F))%>%
  is.na()%>%sum()
    [1] 0
0
dash2 On

You could use (my) 'santoku' package:


library(santoku)
set.seed(1)
DATA = data.frame(X = sample(c(0:100), 1000, replace = TRUE))
DATA$cut <- chop_width(DATA$X, 10, labels = lbl_discrete())
table(DATA$cut)

   0—9  10—19  20—29  30—39  40—49  50—59  60—69  70—79  80—89 90—100 
   102     84     96    102     96    102     90     94    122    112