Slurm + drake: free resources of idle job array workers for dynamic branching

92 Views Asked by At

EDIT: the question title and tags were adjusted after the discovery that the described behavior does not stem from SLURM but from the R package {drake} which is used as a proxy to execute SLURM array jobs.

I've got the following situation:

  • A Slurm job array of n=70 with X CPU and Y Mem per job
  • 120 tasks to be run
  • Each task requires the same CPU + Mem but takes a different time to finish

This leads to the following situation:

For tasks 71-120 (after 1 - 70 completed), I have 50 active workers and 20 idle workers. The idle workers will not do any work anymore and just wait for the active workers to complete.

Now over time more and more workers finish and at some point I have 5 active workers and 65 idle ones. Let's assume that the last 5 tasks take quite some time to complete. During this time, the idle workers block resources on the cluster and also constantly print the following to their respective log files

2021-04-03 19:41:41.866282 | > WORKER_WAIT (0.000s wait)
2021-04-03 19:41:41.868709 | waiting 1.70s
2021-04-03 19:41:43.571948 | > WORKER_WAIT (0.000s wait)

[...]

Is there a way to shut down these idle workers and free resources after there is not more task for them to be allocated? Currently they wait until all workers are done and only then release the resources.

1

There are 1 best solutions below

1
pat-s On

Thanks to the comment of @Michael Schubert I've found that this behavior occurs when using the R package {drake} and its dynamic branching feature (static targets are shutting down just fine).

Here, a "target" can have dynamic "subtargets" which can be computed as separate array jobs via SLURM. These subtargets are getting combined after all have been computed. Until this aggregation step happened, all workers remain in a "waiting" state in which they output the WORKER_WAIT status shown above.

Wild guess: This might not be avoidable due to the design of dynamic targets in {drake} because to aggregate all subtargets these need to exist first. Hence individual subtargets must be kept/saved in a temporary state until all subtargets are available.

The following {drake} R code can be used in conjunction with a SLURM cluster to reproduce the explained behavior:

  list_time = c(30,60),
  test_dynamic = target(
    Sys.sleep(time = list_time),
    dynamic = map(list_time)
  ),