computationally efficient way to manipulate the levels of large deeply-nested objects?

131 Views Asked by At

I have a list of lists of vectors (non a typo, re-confirming that it is infact a list of lists of vectors) that is 76 million in length. So, there is a list of 76 million items where each item is a list of two vectors.

All the vectors are, of uniform length (6 items).

For example the data itself looks as follows for list_of_list[1:50]:

dput output

list(list(c(4, 4, 1, 0, 1, 0), c(3, 3, 2, 2, 0, 0)), list(c(4, 
4, 1, 0, 1, 0), c(3, 4, 3, 1, 0, 0)), list(c(4, 4, 1, 0, 1, 0
), c(4, 5, 1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(5, 8, 0, 
0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(5, 5, 0, 2, 0, 0)), list(
    c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 0, 0)), list(c(4, 4, 
1, 0, 1, 0), c(4, 5, 1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), 
    c(4, 4, 1, 0, 1, 0)), list(c(4, 4, 1, 0, 1, 0), c(6, 10, 
1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 4, 3, 1, 0, 0)), 
    list(c(4, 4, 1, 0, 1, 0), c(5, 7, 2, 0, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(40, 10, 0, 15, 8, 0)), list(c(4, 4, 1, 
    0, 1, 0), c(24L, 7L, 6L, 20L, 8L, 1L)), list(c(4, 4, 1, 0, 
    1, 0), c(39L, 22L, 9L, 5L, 8L, 1L)), list(c(4, 4, 1, 0, 1, 
    0), c(34, 36, 17, 15, 0, 2)), list(c(4, 4, 1, 0, 1, 0), c(36L, 
    42L, 18L, 4L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(4, 5, 
    1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(4, 8, 3, 0, 0, 
    0)), list(c(4, 4, 1, 0, 1, 0), c(3, 1, 2, 2, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(6, 9, 0, 1, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(5, 5, 0, 2, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(6, 10, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(6, 
    10, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 15, 0, 0, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 0, 0)), 
    list(c(4, 4, 1, 0, 1, 0), c(4, 2, 1, 2, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(28, 24, 19, 14, 4, 0)), list(c(4, 4, 1, 
    0, 1, 0), c(40, 56, 19, 11, 0, 0)), list(c(4, 4, 1, 0, 1, 
    0), c(32L, 33L, 14L, 17L, 1L, 2L)), list(c(4, 4, 1, 0, 1, 
    0), c(24L, 55L, 11L, 16L, 6L, 1L)), list(c(4, 4, 1, 0, 1, 
    0), c(27, 10, 6, 19, 8, 0)), list(c(4, 4, 1, 0, 1, 0), c(31, 
    21, 11, 19, 4, 0)), list(c(4, 4, 1, 0, 1, 0), c(37L, 60L, 
    12L, 7L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(29L, 8L, 3L, 
    18L, 8L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(21L, 24L, 20L, 
    14L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(6, 10, 1, 0, 0, 
    0)), list(c(4, 4, 1, 0, 1, 0), c(5, 9, 2, 0, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(7, 13, 0, 0, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(6, 12, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(5, 8, 1, 1, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 
    7, 0, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 6, 1, 1, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(4, 3, 0, 3, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(3, 2, 3, 1, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(4, 4, 1, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 
    3, 2, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 7, 0, 2, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 1, 2, 2, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(6, 7, 0, 1, 0, 0)))

Just FYI, the list of lists was made using combn() using this template: combn(focal_list,2,simplify = FALSE)

Is there a computationally efficient way to turn this into a table of two columns where each row is one item from the list of lists? All the first vectors become the first column and all the second vectors become the second column?

I tried the following and this just kept going after 10-12 minutes with no output, which is just to expensive for my use-case :

dt <- data.table(col1 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][1]),
                 col2 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][2])))

I could use a foreach loop to detangle the deeply nested object and read in the vectors as chars separated by a simple char and then use another foreach loop to create a data.table but before I do that, is there a simpler way in R that I am missing?

Please note for clarification that I want to maintain the vector() like nature of the lowest level items .i.e when you make a table out of the list of lists, each item should be a vector and the data.table should be two columns, it seems R likes to flatten vectors and list when trying to make tables.

3

There are 3 best solutions below

1
ThomasIsCoding On BEST ANSWER

I think you may have several approaches to make it, for example

  • rbindlist + rapply
rbindlist(rapply(list_of_list, list, how = "replace"))
  • as.data.frame + rbind
as.data.frame(do.call(rbind, list_of_list))

However, the second option, i.e., the base R approach as.data.table + rbind seems much faster than the first one (see the benchmarking below)

microbenchmark(
    f1 = rbindlist(rapply(list_of_list, list, how = "replace")),
    f2 = as.data.frame(do.call(rbind, list_of_list)),
    check = "equivalent"
)

which gives

Unit: microseconds
 expr   min    lq    mean median     uq   max neval
   f1 138.7 168.7 177.896 174.10 185.00 392.6   100
   f2  31.7  38.5  45.127  43.55  50.25  88.8   100
0
Onyambu On

I would suggest you use Rcpp like the code below. Since you have 76million, I recomment running the data in batches, ie 10million each. In my computer, it takes 8 secs to convert 10million into a matrix. Meaning if you do this 8 times, it will take approx 70-80 sec. Store the different matrix matches then combine them into one, probably by writting them into one file in the hard drive.

Rcpp::cppFunction(
'NumericVector combineList(std::vector< std::vector<std::vector<double>>> x){
    int n = x.size();
    int m = x[0].size();
    int p = x[0][0].size();
    std::vector<double> y(n*p*m);
    for(int i = 0; i < n; i++)
        for(int j = 0; j < m; j++)
            for(int k = 0; k < p; k++)
                y[p * (i + n * j) + k] = x[i][j][k];
    NumericVector z = wrap(y);
    z.attr("dim") = Dimension(n*p, m);
    return z;
}'
)

combineList(list_of_lists)
       [,1] [,2]
  [1,]    4    3
  [2,]    4    3
  [3,]    1    2
  [4,]    0    2
  [5,]    1    0
  [6,]    0    0
  [7,]    4    3
  [8,]    4    4
  [9,]    1    3
 [10,]    0    1
 [11,]    1    0
 [12,]    0    0
 [13,]    4    4
 [14,]    4    5
 [15,]    1    1
 [16,]    0    0
 [17,]    1    0
 [18,]    0    1
 [19,]    4    5
 [20,]    4    8
0
Sudoh On

I was able to solve this issue fairly easily using this one line of code:

rbindlist(rapply(focal_list, list, how = "replace"))

The fascinating part is that the above code process all 76 millions items in about 2-ish minutes, no Rcpp required (can't say if the packages are using Rcpp underneath the hood).