I'm trying to read multiple large csv files with nested parallelism with future.
I have a single machine with 32 cores, and I want to set up nested parallel (5 by 6) with outer 5 process with 6 cores each. I'm trying to utilize implicit parallelism from data.table::fread(.., nThreads = 6).
The R package future provides nested parallelism, and I've tried
library(future)
plan(list(tweak(multisession, workers = 5), tweak(multisession, workers = 6)))
but above is actually using only 1 cores for each subprocess:
plan(list(tweak(multisession, workers = 5),
tweak(multisession, workers = 6)))
registerDoFuture()
foreach(i = 1:5) %dopar% {
availableCores()
}
[[1]]
mc.cores
1
[[2]]
mc.cores
1
[[3]]
mc.cores
1
[[4]]
mc.cores
1
[[5]]
mc.cores
1
Is there a way to achieve this?
(Futureverse maintainer here)
I see the misunderstanding here. You want to use
nbrOfWorkers()(from future) here instead ofavailableCores()(from parallelly - reexported as-is from future). This will give you what you'd expected:The reason for
availableCores()returning one (1) is because the future framework tries to prevent nested parallelization by mistake. It does this by setting options and environment variables that controls number of parallel workers and CPU cores, includingoptions(mc.cores = 1L). This is correctly picked up byavailableCores(). This prevents, for instance, a package that usesy <- mclapply(X, FUN),cl <- makeCluster(avaiableCores()), orplan(multisession)from running in parallel if already running in a parallel worker. In contrast,nbrOfWorkers()reflects the number of workers specified byplan(). In your case, we haveplan(multisession, workers = 6)set in the parallel workers, from the second level inplan(list(tweak(multisession, workers = 5), tweak(multisession, workers = 6))).To convince yourself you're indeed are running in parallel with your setup, you can adopt one of the examples in https://future.futureverse.org/articles/future-3-topologies.html.
Now, parallel threads are not the same as parallel processes (aka parallel workers). You can think of threads as a much lower-level parallelization mechanism. Importantly, the future framework does not constrain the number of threads used in parallel workers, including the number of parallel threads that data.table uses. Because of this, you need to explicitly call:
or, if you want to be agile to the current settings,
to avoid over-parallelization. Alternatively, you can reconfigure data.table as:
BTW, in doFuture (>= 1.0.0), you longer need
registerDoFuture()if you replace%dopar%with%dofuture%. So, the gist of reading lots of CSV files in parallel is:With all that said, note that your bottleneck will probably be the file system rather than the CPU. When you parallelize reading files, you might overwhelm the file system and end up slowing down the file reading rather than speeding it up. Sometimes it gets faster to read two-three files in parallel, but with more it becomes counterproductive. So, you need to benchmark with different number of parallel workers.
Moreover, these days, there are R packages that are highly-specialized for reading data files into R efficiently. Some of them supports reading multiple files efficiently. The vroom package is one such example.