I have a large dataframe from which I want to take random samples for each column. I'd like sample multiple times and store the sum of my results in a new dataframe.
My dataframe looks like this:
library(microbenchmark)
library(plyr)
library(tidyverse)
set.seed(123)
df <- data.frame(matrix(sample(0:10, replace = T), nrow = 1000, ncol=60))
I have written a function to sample from my dataframe and calculate my statistics.
rd <- function(x) sample(x, size = N, replace =TRUE)
N <- nrow(df)
sampling <- function(df){
df_s <- apply(df, 2, rd)
df_f <- df_s %>%
as.data.frame() %>%
summarise_if(is.numeric, sum)
}
I'd like to replicate this 10000 and save the summary statistics in a new dataframe.
reps <- 10
df_sums <- plyr::rdply(reps, sampling(df))
However, running this code 100 times alone seems to be very inefficient, it takes slightly longer with my original dataset.
microbenchmark(sampling(df), times = 100)
Any suggestion how I can make this more efficient so I can actually run my code 10000 times? I tried to write the function with replicate, but I couldn't get the output to look as neat as with rdply.
Maybe you don't need to resample single columns but can resample the whole data frame at once.
This works much faster:
Please note that with this approach somewhat breaks the independence of values within the rows of
df_sums. If that would be a problem, it can be solved by resampling the columns ofdf_sums: