I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)
Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit
jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution usingsallocorsbatch.One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From
daskperspective what matters is the number of workers, but fromdask-jobqueuethe number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in totaldask-jobqueuewill submit 5 jobs to the SLURM scheduler.This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):