More Efficient Summarise / Summarize in R

55 Views Asked by At

I have code which executes a large number of summarise and it takes ages to run.

eg:

library(dplyr)

df <- data.frame(Letter = letters, Num = c(1 : (26*10) ))

for (x in 1:10000){
  df_sum_Tot = summarise(df, Sum_Num = sum(Num)  )
  df_sum_Letter = summarise(df, Sum_Num = sum(Num) , .by =  Letter )

}

Is there a more efficient alternative to summarise I could use to speed it up?

1

There are 1 best solutions below

0
rw2 On

If you're working with thousands of different datasets, you could put them all into a list and use lapply to summarise them all, rather than using a for loop.

Other packages can also be much more efficient for summarising than dplyr, especially with large datasets. For example, data.table or collapse:

# Assuming datasets is your list of all your data.frames:
# Using data.table
library(data.table)
results <- lapply(datasets, function(df) {
  setDT(df) 
  df_sum_Tot <- df[, .(Sum_Num = sum(Num))]
  df_sum_Letter <- df[, .(Sum_Num = sum(Num)), by = Letter]
  list(Total = df_sum_Tot, ByLetter = df_sum_Letter)
})

# Using collapse:
library(collapse)
results <- lapply(datasets, function(df) {
  df <- as.data.table(df)  
  df_sum_Tot <- collap(df, Sum_Num = fsum(Num))
  df_sum_Letter <- collap(df, Sum_Num = fsum(Num), by = "Letter")
  list(Total = df_sum_Tot, ByLetter = df_sum_Letter)
})