summarize() deprecated in R. What to use instead?

231 Views Asked by At

I got this message from R:

Returning more (or less) than 1 row per summarise() group was deprecated in dplyr 1.1.0. ℹ Please use reframe() instead. ℹ When switching from summarise() to reframe(), remember that reframe() always returns an ungrouped data frame and adjust accordingly. Call lifecycle::last_lifecycle_warnings() to see where this warning was generated.

Now, reframe() keeps all the rows, so do I just need to use distinct() all the time after using reframe() in order to get the same result as the deprecated summarise()? I use summarise mostly after a group_by(). Do you know a more concise workaround?

Thank you for your answers!

Example Code:

business_unit <- c("BU1", "BU1", "BU1", "BU2", "BU2", "BU2", "BU2", "BU3", "BU3")
year <- c(2020, 2020, 2020, 2020, 2020, 2022, 2020, 2021, 2022)
sickness_cases <- c(10, 10, 10, 8, 8, 18, 5, 9, 14)
user_name <- c("John", "John", "Alice", "Alice", "Alice", "Alice", "Paul", "Bob", "Bob")
example_data <- data.frame(business_unit,year, sickness_cases, user_name)

Each row represents a login into an app and the user_name is the person that logged in. I want to calculate the number of logins divided by the number of sickness cases of this business unit. And then show bars per business unit in a ggplot.

In the code below I want to aggregate the data and here is where the warning occurs (I need to use reframe instead of summarize). How can I avoid using reframe and then distinct() in order to get only one row for each business unit and year? If I don't use distinct() I get multiple duplicated rows. With summarize I could have done this directly (without using distinct().

grouped_data <- example_data %>% select(business_unit, year, sickness_cases) %>%
  group_by(business_unit, year) %>%
  reframe(logins = n(), # here I would like to use summarize, so I don't need to use distinct() afterwards.
         logins_div_sickness_case = logins/sickness_cases
         ) %>% distinct()

If I use this code I get the warning:

grouped_data <- example_data %>% select(business_unit, year, sickness_cases) %>%
  group_by(business_unit, year) %>%
  summarize(logins = n(), 
         logins_div_sickness_case = logins/sickness_cases) 

The final output I want, is this plot:

ggplot(grouped_data, aes(x = business_unit, y = logins_div_sickness_case))+
  geom_bar(position="dodge", stat="identity") +
  facet_grid(~year)
1

There are 1 best solutions below

0
Murad Khalilov On

it would be better if you provide example code/data.

Okay, lets talk about differences between reframe and summarize, reframe() creates a new data frame by applying functions to columns of an existing data frame. It is most similar to summarise() , with two big differences: reframe() can return an arbitrary number of rows per group, while summarise() reduces each group down to a single row.

So reframe mostly mutate to main data whereas summarize try to get 1 row for each group, for your case probably you have duplicated rows and you are using wrong aggragation function to get result, thats why summarize can not finalize in one row per group.

what you can do is to use unique() to the values which return double or more rows, for example.


mydf %>% 
  group_by(year) %>% 
  mutate(count_by_year = n()) %>% # total for each year
  group_by(year, result) %>%  
  summarise(count_year_res = n(), # counting of positives and negatives in each year
            perc = unique(count_year_res/count_by_year*100)

it has been used unique function to get 1 rows out of aggregation in perc column.

EDITED

I still dont understand why you want to use reframe, regarding warning about reframe, and you can ignore it, because data structure is more or less correct, but in general if you want to use reframe or summarize you have to have unique input to calculate login_div_sickness_column because when you grouped by for business unit and year, there are 3 same row which you can count with n() but you have also 10, 10, 10 in 3 different rows, but dplyr dont know which one should be used, therefore you get warning, if you want to summarize, we need to decrease this number to 1.

example_data %>% select(business_unit, year, sickness_cases) %>%
  group_by(business_unit, year, sickness_cases) %>%
  summarize(logins = n(), # here I would like to use summarize, so I don't need to use distinct() afterwards.
                 logins_div_sickness_case = logins/unique(sickness_cases)
  ) 

Above code indicates that adding sickness case to groupby and dividing unique(sickness_case) and you will not get any warnings, but again warning is about information about your code, but not error or problem of working style.