dplyr to data.table for speed up execution time

579 Views Asked by user177196 At 04 May 2022 at 22:30

I am currently dealing with a moderately large dataframe called d.mkt (> 2M rows and 12 columns). As dplyr is too slow when applying summarise() function combined with group_by_at, I am trying to write an equivalent statement using data.table to speed up the summarise computation part of dplyr. However, the situation is quite special in the case that the original dataframe is group_by_at and then summarising over the same set of columns (e.g. X %>% select(-id) %>% group_by_at(vars(-x,-y,-z,-t) %>% summarise(x = sum(x), y = sum(y), z = sum(z), y = sum(t)) %>% ungroup()).

With that in mind, below is my current attempt, which kept failing to work because of this error: keyby or by has length (1,1,1,1). Could someone please help let me know how to fix this error?

dplyr's code

    d.mkt <- d.mkt %>%
  left_join(codes, by = c('rte_cd', 'cd')) %>%
  mutate(is_valid = replace_na(is_valid, FALSE),
         rte_cd = ifelse(is_valid, rte_cd, 'RC'),
         rte_dsc = ifelse(is_valid, rte_dsc, 'SKIPPED')) %>%
  select(-is_valid) %>%
  group_by_at(vars(-c_rv, -g_rv, -h_rv, -rn)) %>%
  summarise(c_rv = sum(as.numeric(c_rv)), g_rv = sum(as.numeric(g_rv)), h_rv = sum(as.numeric(h_rv)), rn = sum(as.numeric(rn))) %>%
  ungroup()

My attempt for translating the above

  d.mkt <- as.data.table(d.mkt)
  d.mkt <- d.mkt[codes, on = c('rte_cd', 'sb_cd'),
                        `:=` (is.valid = replace_na(is_valid, FALSE), rte_cd = ifelse(is_valid, rte_cd, 'RC00'), 
                              rte_ds = ifelse(is_valid, rte_ds, 'SKIPPED'))]
  d.mkt <- d.mkt[, -"is.valid", with=FALSE]
  d.mkt <- d.mkt[, .(c_rv=sum(c_rv), g_rv=sum(g_rv), h_rv = sum(h_rv), rn = sum(rn)), by = .('prop', 'date')] --- Error here already, but how do we ungroup a `data.table` though?

Original Q&A

There are 1 best solutions below

r2evans On 04 May 2022 at 22:53

Close. Some suggestions/answers.

If you're shifting to data.table for speed, I suggest use if fifelse in lieu of replace_na and ifelse, minor.
The canonical way to remove is_valid is d.mkt[, is.valid := NULL].
Grouping cab be done with a setdiff. In data.table, there is no need to "ungroup", each [-call uses its own grouping. (For the reason, if you have multiple chained [-operations that use the same grouping, it can be useful to store that group as a variable, perhaps index it, and/or combine all the [-chain into a single call. This is prone to lots of benchmarking discussion outside the scope of what we have here.)
Since all of your summary stats are the same, we can lapply(.SD, ..) this for a little readability improvement.

This might work:

library(data.table)
setDT(codes) # or using `as.data.table(codes)` below instead
setDT(d.mkt) # ditto
tmp <- codes[d.mkt, on = .(rte_cd, cd) ] %>%
  .[, c("is_valid", "rte_cd", "rte_dsc") :=
        .(fcoalesce(is_valid, FALSE),
          fifelse(is.na(is_valid), rte_cd, "RC"),
          fifelse(is.an(is_valid), rte_dsc, "SKIPPED")) ]
tmp[, is_valid := NULL ]
cols <- c("c_rv", "g_rv", "h_rv", "rn")
tmp[, lapply(.SD, function(z) sum(as.numeric(z))),
    by = setdiff(names(tmp), cols), .SDcols = cols ]

dplyr to data.table for speed up execution time

There are 1 best solutions below

Related Questions in R

Related Questions in DATAFRAME

Related Questions in DPLYR

Related Questions in DATA.TABLE

Related Questions in DTPLYR

Trending Questions

Popular # Hahtags

Popular Questions