All combinations 1 to n_1, 2 to n_2, ..., n to n_n in R as vectors in a list?

105 Views Asked by At

I am looking for the most efficient way to create combinations based on n different series in R.

Base R has this nice function called expand.grid which returns all the combinations as a data frame, but I need each and every combination as vector as a separate list elements.

So from there it should not be - and it is not - a difficult task to achieve what I want, but the fact that first I need to create a data frame seems to be an unneeded step in this process.

Let's have an example: Let's assume I want all the combinations, where the first element is 1,2,3 or 4, the second one is 1 or 2, and the third one is 1,2 or 3. The input should be the highest integer in the series:

library(dplyr)
c(4,2,3) %>% #This is the input; the length can be anything, here it happens to be 3
    lapply(\(elem) seq(elem)) %>% #Here we create the sequences: 1,2,3,4 & 1,2 & 1,2,3
    expand.grid %>% #A data frame with all the possible combinations
    {split(unlist(.),seq(nrow(.)))} #We unlist the whole data frame into one vector, and split it into 1,2,...,nrow(data frame) equally sized vectors, which ultimately become list elements

So, is there a more efficient way to achieve this?

3

There are 3 best solutions below

3
Onyambu On BEST ANSWER

(I just added this in order to remove the down vote, Aku-Ville Lehtimäki)

If looking for efficiency, consider writing the code in C++:

Rcpp::cppFunction("
std::vector<std::vector<int>> product(std::vector<int> x){
  std::vector<std::vector<int>> result = {{}};
  for (auto i: x){
      std::vector<std::vector<int>> sub_result;
      for (auto j: result) for (auto k = 1; k<=i;k++) {
          std::vector<int> row(j);
          row.push_back(k);
          sub_result.push_back(row);
      }
      result = sub_result;
  }
  return result;
}
")

product(1:3)
[[1]]
[1] 1 1 1

[[2]]
[1] 1 1 2

[[3]]
[1] 1 1 3

[[4]]
[1] 1 2 1

[[5]]
[1] 1 2 2

[[6]]
[1] 1 2 3

Edit

I believe its possible to only use indexing instead of data copying which will significantly improve the speed. Here is a way to achieve half of the indexing:

Rcpp::cppFunction("
std::vector<std::vector<int>> product_2(std::vector<int> x){
  std::vector<std::vector<int>> result = {{}};
  for (auto i: x){
      std::vector<std::vector<int>> sub_result;
      for (auto j: result) {
        int n = j.size();
        j.resize(n+1);
        for (auto k = 1; k<=i;k++) {
          j[n] = k;
          sub_result.push_back(j);
        }
      }
      result = sub_result;
  }
  return result;
}
")

microbenchmark::microbenchmark(product(n), product_2(n), check = 'equal')
Unit: milliseconds
         expr    min      lq     mean  median      uq     max neval
   product(n) 5.3796 5.53625 6.997803 6.17555 7.87635 14.4701   100
 product_2(n) 3.3674 3.55285 4.583415 4.14890 5.50670  8.4493   100

Note that this is not the best, but its the best I could think of so far

6
ThomasIsCoding On

what about this?

split(
    unlist(expand.grid(lapply(n, seq_len)), use.names = FALSE),
    rep(seq_len(prod(n)), length(n))
)

Benchmarking

n <- c(4, 2, 3, 6, 4, 5, 8)

f0 <- function(n) {
    n %>% # This is the input; the length can be anything, here it happens to be 3
        lapply(\(elem) seq(elem)) %>% # Here we create the sequences: 1,2,3,4 & 1,2 & 1,2,3
        expand.grid() %>% # A data frame with all the possible combinations
        {
            split(unlist(.), seq(nrow(.)))
        } # We unlist the whole data frame into one vector, and split it into 1,2,...,nrow(data frame) equally sized vectors, which ultimately become list elements
}
f1 <- function(n) {
    split(
        unlist(expand.grid(lapply(n, seq_len)), use.names = FALSE),
        rep(seq_len(prod(n)), length(n))
    )
}

f2 <- function(n) {
    asplit(unname(expand.grid(lapply(n, seq_len))), 1)
}

microbenchmark(
    f0 = f0(n),
    f1 = f1(n),
    f2 = f2(n),
    unit = "relative"
)

and you will see

Unit: relative
 expr      min       lq     mean   median       uq      max neval
   f0 2.891338 2.868985 2.680626 3.072950 3.174406 1.281119   100
   f1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100
   f2 3.530446 3.628943 3.620389 4.094486 4.448569 1.323821   100
0
Aku-Ville Lehtimäki On

I tried to write my own function in C++, but it did not perform well in the benchmark.

cppFunction('
#include <Rcpp.h>
using namespace Rcpp;

List generate_combinations(IntegerVector input) {
  int n = input.size();
  std::vector<int> indices(n, 0);
  List result;
  
  while (true) {
    IntegerVector combination(n);
    for (int i = 0; i < n; ++i) {
      combination[i] = indices[i] + 1;
    }
    result.push_back(combination);
    int i = 0;
    while (i < n && ++indices[i] == input[i]) {
      indices[i++] = 0;
    }
    if (i == n) break;
  }
  
  return result;
}

// [[Rcpp::export]]
List generate_combinations_cpp(IntegerVector input) {
  return generate_combinations(input);
}
')

And here is the updated code for comparisons:

n <- c(4, 2, 3, 6, 4, 5, 8)

f0 <- function(n) {
  n %>% # This is the input; the length can be anything, here it happens to be 3
    lapply(\(elem) seq(elem)) %>% # Here we create the sequences: 1,2,3,4 & 1,2 & 1,2,3
    expand.grid() %>% # A data frame with all the possible combinations
    {
      split(unlist(.), seq(nrow(.)))
    } # We unlist the whole data frame into one vector, and split it into 1,2,...,nrow(data frame) equally sized vectors, which ultimately become list elements
}
f1 <- function(n) {
  split(
    unlist(expand.grid(lapply(n, seq_len)), use.names = FALSE),
    rep(seq_len(prod(n)), length(n))
  )
}

f2 <- function(n) {
  asplit(unname(expand.grid(lapply(n, seq_len))), 1)
}


microbenchmark(
  f0 = f0(n),
  f1 = f1(n),
  f2 = f2(n),
  f3 = product(rev(n)),
  f4 = generate_combinations_cpp(n),
  unit = "relative"
)

And here are the results:

Unit: relative
 expr        min         lq       mean     median         uq        max neval
   f0   6.075446   6.201464   5.949658   6.038376   5.923006   3.852680   100
   f1   1.798032   1.775192   1.853756   1.727536   1.734581   3.881149   100
   f2   5.647874   5.462727   5.170237   5.189570   4.989141   4.967802   100
   f3   1.000000   1.000000   1.000000   1.000000   1.000000   1.000000   100
   f4 226.973744 224.315454 213.317704 216.482964 208.224358 123.438426   100

Onyambu's code is clearly the winner. My C++ code is crap, so don't use generate_combinations_cpp. :D