Looping a function through a list of dataframes is very slow

185 Views Asked by Luong Nguyen Thanh At 08 November 2022 at 14:39

I have a list, which contains 4438 dataframes with different sizes. I am not sure how to make a reproducible example, but the way I obtained the list is using the expand.grid function to have a dataframe with all the possible combination of elements:

citation <- citation %>% 
    map_depth(., 1, expand.grid)

List before applying expand.grid

List after applying expand.grid

What I am going to achieve is for each dataframe, counting the number of unique values per row, and finding the minimum number of unique values in the dataframe.

First, I write the function below

fun1 <- function(res){ 
    min(apply(res,1,function(x) length(unique(x)))) 
}

Then, apply the function to each dataframe:

library(furrr)
plan(multisession, workers = 4)
min_set <- c()
min_set <- citation %>% future_map_dbl(fun1)

However, the calculation is super slow, almost 30 mins to complete. I would like to find another way to accelerate the performance. Looking forward to hear the solution from you guys. Thank you in advance

Original Q&A

There are 1 best solutions below

jblood94 On 14 November 2022 at 14:59

To speed up the current approach of enumerating the combinations, use rowTabulate from the Rfast package (or rowTabulates from the matrixStats package).

However, it will be much faster to get the desired results with the setcover function in the adagio package, which solves the set cover problem directly (i.e., without the use of expand.grid) via integer linear programming with lp from the lpSolve package.

library(Rfast) # for the rowTabulate function
library(adagio) # for the setcover function

# reproducible example data
set.seed(1141593349)
citation1 <- list(
  lapply(c(5,2,8,12,6,38), function(size) sample(50, size)),
  lapply(c(5,2,8,12,7), function(size) sample(50, size))
)
# get all combinations of the indices of the unique values for each list in citation1
citation2 <- lapply(citation1, function(x) expand.grid(lapply(x, match, table = unique(unlist(x)))))

# original solution
fun1 <- function(res) min(apply(res, 1, function(x) length(unique(x))))
# faster version of the original solution
fun2 <- function(res) min(rowsums(rowTabulate(as.matrix(res)) > 0L))
# linear programming solution (uses citation1 rather than citation2)
fun3 <- function(res) {
  v <- unlist(res)
  m <- matrix(0L, max(v), length(res))
  m[cbind(v, rep.int(seq_along(res), lengths(res)))] <- 1L
  setcover(m)$objective
}

microbenchmark::microbenchmark(fun1 = sapply(citation2, fun1),
                               fun2 = as.integer(sapply(citation2, fun2)),
                               fun3 = as.integer(sapply(citation1, fun3)),
                               times = 10,
                               check = "identical")
#> Unit: milliseconds
#>  expr       min          lq        mean      median          uq         max
#>  fun1 1110.4976 1162.003601 1217.049501 1204.608151 1281.121601 1331.057001
#>  fun2  101.5173  113.123501  142.265371  145.964502  165.788700  187.196301
#>  fun3    1.4038    1.461101    1.734781    1.850701    1.870801    1.888702

Looping a function through a list of dataframes is very slow

There are 1 best solutions below

Related Questions in R

Related Questions in PERFORMANCE

Related Questions in UNIQUE

Related Questions in PURRR

Related Questions in FURRR

Trending Questions

Popular # Hahtags

Popular Questions