Count number of outliers by group in r and store count in new dataframe

440 Views Asked by At

I have a dataset that has 2 columns; column A is State_Name and has 5 different options of state, and column B is Total_Spend which has the average total spend of that state per day. There are 365 observations for each state.

What I want to do is count the number of outliers PER STATE using the 1.5 IQR rule and save the count of outliers per state to a new df or table.

So I would expect an output something like:

State Outlier Count
ATL 5
GA 20
MI 11
NY 50
TX 23

I have managed to get it to work by doing it one state at a time but I can't figure out what to do to achieve this in a single go.

Here is my code at the moment (to return the result for a single state):

  daily_agg %>% 
  select(State_Name, Total_Spend) %>%
  filter(State_Name == "NY")

outlier_NY <- length(boxplot.stats(outlier_df$Total_Spend)$out)

Any help would be appreciated.

Thanks!

EDIT WITH TEST DATASET


outlier_mtcars <- 
  df %>%  
  select(cyl, disp) %>%
  filter(cyl == "6")
  
outliers <- length(boxplot.stats(outlier_mtcars$disp)$out)

The above shows me 1 outlier for 6 cyl cars but I want a table that shows how many outliers for 4, 6, 8 cyl cars

2

There are 2 best solutions below

3
benson23 On BEST ANSWER

Since I'm not very familiar with the function boxplot.stats, I didn't use this in my solution and instead manually calculates 1.5 * IQR + upper quantile.

Here mtcars was used as an example. For the records that are outliers, they are "flagged" as TRUE, where we can sum them up in summarize.

library(dplyr)

mtcars %>% 
  group_by(cyl) %>% 
  mutate(flag = disp >= (IQR(disp) * 1.5 + quantile(disp, probs = 0.75)), .keep = "used") %>% 
  summarize(Outlier = sum(flag))

# A tibble: 3 × 2
    cyl Outlier
  <dbl>   <int>
1     4       0
2     6       1
3     8       0
1
Justin Beresford On

Since I don't have your data, I'll make some up with the two columns you mention:

df<-data.frame(state=sample(c("ny","fl"),100, replace=TRUE),
           spend=sample(1:100, 100, replace=TRUE))

> head(df)
  state spend
1    ny     3
2    fl    87
3    ny    91
4    fl    97
5    ny    47
6    fl     8

Then set your upper and lower bounds (could be quartiles, absolutes, whatever..)

df%>%
  group_by(state)%>%
  mutate(lower_bound=quantile(spend,0.25),
         upper_bound=quantile(spend,0.75))%>%
  mutate(is_outlier=if_else(spend<lower_bound|spend>upper_bound,TRUE,FALSE))

# A tibble: 10 × 5
# Groups:   state [2]
   state spend lower_bound upper_bound is_outlier
   <chr> <int>       <dbl>       <dbl> <lgl>     
 1 ny        3          38          84 TRUE      
 2 fl       87          26          87 FALSE     
 3 ny       91          38          84 TRUE      
 4 fl       97          26          87 TRUE 

Then if you only want to see the output, summarise by is_outlier:

df%>%
  group_by(state)%>%
  mutate(lower_bound=quantile(spend,0.25),upper_bound=quantile(spend,0.75))%>%
  mutate(is_outlier=if_else(spend<lower_bound|spend>upper_bound,TRUE,FALSE))%>%
  summarise(outliers=sum(is_outlier))

  state         outliers
  <chr>             <int>
1 fl                   19
2 ny                   30