How is the 'cut' function setting intervals?

81 Views Asked by At

I'm new to R and am trying to set up the components of a frequency table (frequency, cumulative frequency, relative frequency, and cumulative %). I was given a set of 24 numbers (ranging from 64 to 100) to group into categories with a width of 10. Here is where I run into a problem. I want to create 5 categories with 10 numbers in each category (60-69, 70-79, 80-89, 90-99, 100-109). When I sequence from 60 to 109 by 10 in R, it creates categories of 11 (60-70, 70-80, 80-90, etc.). If I ask R to do the same task but from 59-109 by 10, it gives me the correct values for the rest of the code, but my categories are numerically inaccurate. Do I need to use a different function to get the correct result or is there a way for me to set R to count 60-69 as 10 in the same way it does when asking length(60-69): >10 ?

Given 24 points of data, I entered each number into a vector. I turned my vector into a data frame with 1 column that I assigned the vector to. I tried using the sequence command in R to categorize the data frame of 24 numbers into the categories 60-69 thru 100-109 in increments of 10 using the following code:

interval_table <- table(cut(data_framex$col1, seq(60, 109, 10)

The output gave me :

(60,70]  (70,80]  (80,90] (90,100] 
      3        9        8        4

When I ask the length of 60:70 in R, it tells me the length=11, so I am assuming it is somehow starting the count at 61 instead of 60 even though the category is inclusive to all numbers between 60 and 70.

If I set the sequence to the following, it gives me the correct counts, but the categories are incorrect.

interval_table <- table(cut(data_framex$col1, seq(59, 109, 10)

Output:

 (59,69]  (69,79]  (79,89]  (89,99] (99,109] 
       2        7       11        3        1 

See the full code below. Since I am new, I may be thinking about this completely wrong and should be using a different code, but I can't find the answer with my search results. I appreciate the help!

x <- c(66, 80, 89, 71, 80, 88, 82, 98, 83, 100, 72, 70, 64, 75, 79, 82, 88, 71, 85, 94, 93, 80, 77, 83)
data_framex <- data.frame(col1 = x)
interval_table <- table(cut(data_framex$col1, seq(60, 109, 10)))
interval_table

Output:
 (60,70]  (70,80]  (80,90] (90,100] 
       3        9        8        4

Desired Output:
 (60,69]  (70,79]  (80,89]  (90,99] (100,109] 
       2        7       11        3        1 
2

There are 2 best solutions below

0
Afaq Shahid Khan On

you can use the right parameter in the cut function and set it to FALSE. This will make the intervals left-closed. Here's how you can modify your code:

#for example
# your data
x <- c(66, 80, 89, 71, 80, 88, 82, 98, 83, 100, 72, 70, 64, 75, 79, 82, 88, 71, 85, 94, 93, 80, 77, 83)

# Create a data frame
data_framex <- data.frame(col1 = x)

# Create the intervals with left-closed intervals
interval_table <- table(cut(data_framex$col1, seq(60, 109, 10), right = FALSE))

# Print the result
print(interval_table)
0
G. Grothendieck On

Note that the desired output names are not correct because it excludes the lower bound and includes the upper bound. That is (a, b] means a < x <= b so a is never included. See standard interval notation for more information on notation. To fix that use [a,b] as shown below.

tab <- table(x %/% 10)
names(tab) <- sprintf("[%s0,%s9]", names(tab), names(tab))
tab

##   [60,69]   [70,79]   [80,89]   [90,99] [100,109] 
##         2         7        11         3         1 

Alternately just drop the brackets/parentheses

names(tab) <- sprintf("%s0-%s9", names(tab), names(tab))
tab

##   60-69   70-79   80-89   90-99 100-109 
##       2       7      11       3       1 

Note

x <- c(66, 80, 89, 71, 80, 88, 82, 98, 83, 100, 72, 70, 64, 75, 79, 82, 
       88, 71, 85, 94, 93, 80, 77, 83)