Looping a function through a list of dataframes is very slow

185 Views Asked by At

I have a list, which contains 4438 dataframes with different sizes. I am not sure how to make a reproducible example, but the way I obtained the list is using the expand.grid function to have a dataframe with all the possible combination of elements:

citation <- citation %>% 
    map_depth(., 1, expand.grid) 

List before applying expand.grid enter image description here

List after applying expand.grid enter image description here

What I am going to achieve is for each dataframe, counting the number of unique values per row, and finding the minimum number of unique values in the dataframe.

First, I write the function below

fun1 <- function(res){ 
    min(apply(res,1,function(x) length(unique(x)))) 
}

Then, apply the function to each dataframe:

library(furrr)
plan(multisession, workers = 4)
min_set <- c()
min_set <- citation %>% future_map_dbl(fun1)

However, the calculation is super slow, almost 30 mins to complete. I would like to find another way to accelerate the performance. Looking forward to hear the solution from you guys. Thank you in advance

1

There are 1 best solutions below

1
jblood94 On

To speed up the current approach of enumerating the combinations, use rowTabulate from the Rfast package (or rowTabulates from the matrixStats package).

However, it will be much faster to get the desired results with the setcover function in the adagio package, which solves the set cover problem directly (i.e., without the use of expand.grid) via integer linear programming with lp from the lpSolve package.

library(Rfast) # for the rowTabulate function
library(adagio) # for the setcover function

# reproducible example data
set.seed(1141593349)
citation1 <- list(
  lapply(c(5,2,8,12,6,38), function(size) sample(50, size)),
  lapply(c(5,2,8,12,7), function(size) sample(50, size))
)
# get all combinations of the indices of the unique values for each list in citation1
citation2 <- lapply(citation1, function(x) expand.grid(lapply(x, match, table = unique(unlist(x)))))

# original solution
fun1 <- function(res) min(apply(res, 1, function(x) length(unique(x))))
# faster version of the original solution
fun2 <- function(res) min(rowsums(rowTabulate(as.matrix(res)) > 0L))
# linear programming solution (uses citation1 rather than citation2)
fun3 <- function(res) {
  v <- unlist(res)
  m <- matrix(0L, max(v), length(res))
  m[cbind(v, rep.int(seq_along(res), lengths(res)))] <- 1L
  setcover(m)$objective
}

microbenchmark::microbenchmark(fun1 = sapply(citation2, fun1),
                               fun2 = as.integer(sapply(citation2, fun2)),
                               fun3 = as.integer(sapply(citation1, fun3)),
                               times = 10,
                               check = "identical")
#> Unit: milliseconds
#>  expr       min          lq        mean      median          uq         max
#>  fun1 1110.4976 1162.003601 1217.049501 1204.608151 1281.121601 1331.057001
#>  fun2  101.5173  113.123501  142.265371  145.964502  165.788700  187.196301
#>  fun3    1.4038    1.461101    1.734781    1.850701    1.870801    1.888702