I want to rowbind multiple dataframes each with a varying number of rows. I understand that using a for loop in which something like a final_df is overwritten in each iteration is very slow as R has to keep a copy of every alteration. Usually, this problem can be solved by pre-allocating the correct number of rows and columns to the dataframe and then modifying it in place for each iteration of the loop. However, in my situation this is a bit more tricky as each individual dataframe may have a different number of rows compared to the previous one. (In my actual code I am dealing with a long list of XML files, from which I am extracting certain bits of information. Depending on the file, I can end up with more rows or fewer rows.)
My attempt so far is to use dplyr::bind_rows() or data.table::rbindlist(), which seem to be performing similarly well and both outclass do.call("rbind") by a lot. However, I notice that even with these approaches, the computation speed will still increase nonlinearly if I increase the number of dataframes.
Do you have suggestions on how I can further improve the speed of my code? Thanks a lot in advance!
create_df <- function() {
nrow <- sample(12:15, 1)
ncol <- 10
randomdf <- matrix(rnorm(nrow*ncol), nrow=nrow, ncol=ncol) |> data.frame()
return(randomdf)
}
approach1 <- function(n) {
final_df <<- matrix(ncol=ncol, nrow=0)
for(i in 1:n) {
current_df <- create_df()
final_df <<- rbind(final_df, current_df)
}
}
approach2 <- function(n) {
df_list <<- vector("list", n)
for(i in 1:n) {
df_list[[i]] <<- create_df()
}
final_df <<- do.call("rbind", df_list)
}
approach3 <- function(n) {
df_list <<- vector("list", n)
for(i in 1:n) {
df_list[[i]] <<- create_df()
}
final_df <<- dplyr::bind_rows(df_list)
}
approach4 <- function(n) {
df_list <<- vector("list", n)
for(i in 1:n) {
df_list[[i]] <<- create_df()
}
final_df <<- data.table::rbindlist(df_list)
}
microbenchmark::microbenchmark(
approach1(5),
approach2(5),
approach3(5),
approach4(5),
approach1(50),
approach2(50),
approach3(50),
approach4(50),
approach1(500),
approach2(500),
approach3(500),
approach4(500),
times = 10
)
Unit: microseconds
expr min lq mean median uq max neval
approach1(5) 1173.5 1201.1 1317.12 1285.30 1402.2 1557.0 10
approach2(5) 771.6 781.8 1121.18 829.15 944.6 3573.1 10
approach3(5) 543.7 613.4 966.10 672.15 952.4 3131.8 10
approach4(5) 520.8 586.5 641.18 621.65 663.8 818.8 10
approach1(50) 12186.9 12381.4 13932.40 12760.10 14518.8 18537.4 10
approach2(50) 6497.6 6766.0 7160.26 6967.55 7230.3 8390.6 10
approach3(50) 3681.3 4143.1 4258.44 4233.10 4347.8 5022.8 10
approach4(50) 3806.7 3821.8 4166.71 3962.95 4190.6 5900.4 10
approach1(500) 275530.0 285779.1 326732.16 294302.30 304461.0 622130.3 10
approach2(500) 65243.8 67456.7 72789.76 74422.30 77063.0 79485.0 10
approach3(500) 38600.0 39328.4 41372.67 41215.80 42345.2 47488.8 10
approach4(500) 32496.5 36788.1 41160.35 39940.10 46043.2 49752.9 10
approach3andapproach4are spending most of their time increate_df, so you're not getting a good idea of the speed of the binding operation. Better to time just the binding:Now it is clear that
rbindlistis the fastest for larger lists of tables. If your process is taking a long time, the binding operation probably isn't the first place I would look.If you know the table columns all line up, you can squeeze a little more performance out of
rbindlistby setting theuse.namesargument toFALSE.