How to trim X% top and bottom of a data frame (based on a column)?

Question

How to trim X% top and bottom of a data frame (based on a column)?

75 Views Asked by Joe the Second At 07 October 2023 at 01:54

I have the following data frame:

set.seed(3994)
val <- round(runif(n=30, min = 5, max= 300), digits=0)
cat <- rep(c("A", "B", "C"), each= 10)
date <- as.Date(sample(seq(as.Date('2000/01/01'), as.Date('2020/01/01'), by="day"), 30))

df <- data.frame(val, cat, date)
df <- df %>%  
  arrange(cat, val)

I want to trim top X% and bottom X% of my data for each category based on column cat. For example I want to remove top 2% and bottom 2% for category "A", "B", and "C". When the data is sorted based on val column.

I wrote the following code:

trimTopBottomByCategory <- function(dataframe, category_col, numeric_col, date_column,  x) {
  trimmed_dataframes <- list()
  
  categories <- unique(dataframe[[category_col]])
  for (category in categories) {
    subset_df <- dataframe[dataframe[[category_col]] == category, ]
    
    n <- nrow(subset_df)
    num_to_trim <- ceiling(x / 100 * n)
    
    sorted_subset <- subset_df[order(subset_df[[numeric_col]]), ]
    trimmed_df <- sorted_subset[(num_to_trim + 1):(n - num_to_trim), ]
    trimmed_dataframes[[category]] <- trimmed_df
  }

  trimmed_combined <- do.call(rbind, trimmed_dataframes)
  return(trimmed_combined <- trimmed_combined %>% 
             arrange(category_col, date_column))
}

My Question: I hope my code is doing what it is supposed to. But I was wondering if there is a method in R that does the same?

Bonus Question: I don't understand my final data is not sorted for the date column

Original Q&A

There are 2 best solutions below

asd-tm On 07 October 2023 at 06:27

Here is a dplyr option:

library(dplyr)
df %>%
group_by(cat) %>%
mutate(proc = (row_number()-1)/(n()-1)*100) %>% #modify here if you need to adjust code 
# for instance to remove rows by value not row_number order use rank()
filter(between(proc, 2, 98)) %>%
ungroup %>%
select(-proc) %>%
arrange(date)

**jay.sf** · Accepted Answer · 2023-10-07T06:04:26.667000

order by cat and data rather than by cat and val. (Should also work with dplyr::arrange, but I don't want to load dplyr.)

df <- df[with(df, order(cat, date)), ]

You can use ave, where first argument is value val, and second is the category cat. ave applies FUN to the values in each category. To get the highest and lowest 2% we can use quantile, and compare the values subsequently. Actually it's boolean, but due to val is numeric we get numeric back, so we use as.logical to get desired boolean, with which we can generate ss to subset the data frame.

ss <- with(df, as.logical(ave(val, cat, FUN=\(x) {
  q <- quantile(x, probs=c(.02, 1 - .02))
  x >= q[1] & x <= q[2]
})))

df[ss, ]
#    val cat       date
# 3   81   A 2000-08-10
# 10 188   A 2000-11-03
# 4  171   A 2006-11-26
# 2  182   A 2009-07-05
# 7  173   A 2010-09-12
# 6   54   A 2012-06-01
# 1  227   A 2014-08-05
# 9   95   A 2016-09-13
# 17 219   B 2002-12-29
# 14 221   B 2004-07-28
# 18 225   B 2011-06-29
# 19 191   B 2013-03-05
# 16 236   B 2013-09-27
# 12 117   B 2015-11-30
# 15 131   B 2017-11-22
# 13  92   B 2019-02-09
# 27 251   C 2000-03-13
# 30 160   C 2001-03-12
# 28 112   C 2002-02-19
# 29 174   C 2005-07-19
# 22 248   C 2006-12-23
# 21 176   C 2012-01-25
# 26  85   C 2016-08-06
# 24  56   C 2017-12-12

Data:

df <- structure(list(val = c(81, 188, 171, 12, 264, 182, 173, 54, 227, 
95, 219, 221, 274, 78, 225, 191, 236, 117, 131, 92, 251, 160, 
112, 265, 174, 248, 176, 42, 85, 56), cat = c("A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C", "C", 
"C"), date = structure(c(11179, 11264, 13478, 13910, 14119, 14430, 
14864, 15492, 16287, 17057, 12050, 12627, 14565, 14605, 15154, 
15769, 15975, 16769, 17492, 17936, 11029, 11393, 11737, 12467, 
12983, 13505, 15364, 15472, 17019, 17512), class = "Date")), row.names = c(3L, 
10L, 4L, 5L, 8L, 2L, 7L, 6L, 1L, 9L, 17L, 14L, 20L, 11L, 18L, 
19L, 16L, 12L, 15L, 13L, 27L, 30L, 28L, 23L, 29L, 22L, 21L, 25L, 
26L, 24L), class = "data.frame")

How to trim X% top and bottom of a data frame (based on a column)?

There are 2 best solutions below

Related Questions in R

Related Questions in SORTING

Related Questions in DATA-CLEANING

Related Questions in TRIM

Trending Questions

Popular # Hahtags

Popular Questions