I'm trying to optimize a highly parallel and memory-intensive targets pipeline. I'm noticing that the wall clock time for downstream dynamic branch targets is much longer than the reported execution time for the same target. Example:
● built branch PSUT_Re_all_Chop_all_Ds_all_Gr_all_f29c72e5 [11.05 seconds]
Wall clock time: 20.07 seconds.
To optimize, I would like to reduce the discrepancy between wall clock time and execution time, if possible. But what could be causing this discrepancy?
Background:
- The input data for each branch target (e.g.,
_f29c72e5) is created dynamically from rows of a (much) larger upstream data frame target. - I set
storage = "worker"andretrieval = "worker", as suggested for highly parallel pipelines at https://books.ropensci.org/targets/performance.html. - I set
memory = "transient"andgarbage_collection = TRUEas suggested for high-memory pipelines at https://books.ropensci.org/targets/performance.html. - The entire upstream (input) data frame takes about 8 seconds to read from disk with
tar_read()in an interactive session, nearly the full discrepancy between wall clock time and execution time.
Thus, my working theory is that each dynamically created downstream branch is loading the entire upstream target, then slicing, then sending the slices to each branch target's function.
Is that theory plausible? If so, I will create an example project and post another question for how to solve this problem.
Thanks in advance for insights.
There are a couple things you could try. One is to profile the pipeline and look at the flame graph to see what is slowing things down.
You may want to run this in the terminal instead of RStudio because the latter sometimes has a strange interaction with
profferandtargetstogether.If reading that dataset really is the bottleneck, you could set up your pipeline like this:
The first time you run the pipeline, you could call
tar_make(data_slice)to build all the slices locally while keeping the big dataset in memory. (If you are usingcrew, I recommend commenting out the controller at this step.) Then ifdata_sliceis all up to date, you could run a secondtar_make()(or e.g.tar_make_clustermq()) to run the rest of the targets. At this secondtar_make(),big_dataanddata_sliceare up to date, so the full dataset should not need to load at all.you could try setting
memory = "persistent"just for that upstream data target, just while you are building branches.