Improve efficiency of bootstrap function

62 Views Asked by At

I have a large dataframe from which I want to take random samples for each column. I'd like sample multiple times and store the sum of my results in a new dataframe.

My dataframe looks like this:

library(microbenchmark)
library(plyr)
library(tidyverse)

set.seed(123)
df <- data.frame(matrix(sample(0:10, replace = T), nrow = 1000, ncol=60))

I have written a function to sample from my dataframe and calculate my statistics.

rd <- function(x) sample(x, size = N, replace =TRUE)
N <- nrow(df)

sampling <- function(df){
  df_s <- apply(df, 2, rd) 
  
  df_f <- df_s %>% 
    as.data.frame() %>%
    summarise_if(is.numeric, sum)
  }

I'd like to replicate this 10000 and save the summary statistics in a new dataframe.

reps <- 10
df_sums <- plyr::rdply(reps, sampling(df))

However, running this code 100 times alone seems to be very inefficient, it takes slightly longer with my original dataset.

microbenchmark(sampling(df), times = 100)

Any suggestion how I can make this more efficient so I can actually run my code 10000 times? I tried to write the function with replicate, but I couldn't get the output to look as neat as with rdply.

2

There are 2 best solutions below

3
Robert Hacken On

Maybe you don't need to resample single columns but can resample the whole data frame at once.

sampling2 <- function(df) {
  colSums(df[sample(nrow(df), replace=T), ])
}
df_sums <- t(replicate(100, sampling2(df)))

This works much faster:

microbenchmark(sampling(df),
               sampling2(df))
# Unit: milliseconds
#          expr       min        lq      mean    median        uq      max neval cld
#  sampling(df) 62.047601 64.451151 77.350142 69.629501 81.000350 476.3357   100   b
# sampling2(df)  1.427401  1.562552  1.954756  1.654052  1.906201  13.3865   100  a 

Please note that with this approach somewhat breaks the independence of values within the rows of df_sums. If that would be a problem, it can be solved by resampling the columns of df_sums:

df_sums <- apply(df_sums, 2, sample)
0
jblood94 On

Since it is sampling with replacement, you can trade memory for speed. Take 10k bootstrap samples times 1k rows = 10M samples of each column, put them in a 10k-by-1k matrix and take the row sums. The example matrix takes 30 seconds on my 8-year-old laptop.

set.seed(123)
df <- data.frame(matrix(sample(0:10, 6e4, 1), 1e3, 60))

library(Rfast)

system.time(
  sapply(df, \(x) rowsums(matrix(sample(x, 1e4*length(x), 1), 1e4)))
)
#>    user  system elapsed 
#>   27.12    1.06   30.47