I have the following data frame:
set.seed(3994)
val <- round(runif(n=30, min = 5, max= 300), digits=0)
cat <- rep(c("A", "B", "C"), each= 10)
date <- as.Date(sample(seq(as.Date('2000/01/01'), as.Date('2020/01/01'), by="day"), 30))
df <- data.frame(val, cat, date)
df <- df %>%
arrange(cat, val)
I want to trim top X% and bottom X% of my data for each category based on column cat. For example I want to remove top 2% and bottom 2% for category "A", "B", and "C". When the data is sorted based on val column.
I wrote the following code:
trimTopBottomByCategory <- function(dataframe, category_col, numeric_col, date_column, x) {
trimmed_dataframes <- list()
categories <- unique(dataframe[[category_col]])
for (category in categories) {
subset_df <- dataframe[dataframe[[category_col]] == category, ]
n <- nrow(subset_df)
num_to_trim <- ceiling(x / 100 * n)
sorted_subset <- subset_df[order(subset_df[[numeric_col]]), ]
trimmed_df <- sorted_subset[(num_to_trim + 1):(n - num_to_trim), ]
trimmed_dataframes[[category]] <- trimmed_df
}
trimmed_combined <- do.call(rbind, trimmed_dataframes)
return(trimmed_combined <- trimmed_combined %>%
arrange(category_col, date_column))
}
My Question: I hope my code is doing what it is supposed to. But I was wondering if there is a method in R that does the same?
Bonus Question: I don't understand my final data is not sorted for the date column
orderby cat and data rather than by cat and val. (Should also work withdplyr::arrange, but I don't want to loaddplyr.)You can use
ave, where first argument is value val, and second is the category cat.aveappliesFUNto the values in each category. To get the highest and lowest 2% we can usequantile, and compare the values subsequently. Actually it's boolean, but due tovalis numeric we get numeric back, so we useas.logicalto get desired boolean, with which we can generatessto subset the data frame.Data: