Job, Worker, and Task in dask_jobqueue

842 Views Asked by HashBr0wn At 04 May 2022 at 03:39

I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:

In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.

Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.

My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.

However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.

So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)

Original Q&A

There are 1 best solutions below

SultanOrazbayev On 04 May 2022 at 04:17

Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.

One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).

From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.

This example adapted from docs, will result in 10 dask-worker, each with 24 cores:

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
    queue='regular',
    project="myproj",
    cores=24,
    processes=1,
    memory="500 GB"
)
cluster.scale(jobs=10)  # ask for 10 jobs

If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
    queue='regular',
    project="myproj",
    cores=20,
    processes=10,
    memory="500 GB"
)
cluster.scale(jobs=10)  # ask for 10 jobs

Job, Worker, and Task in dask_jobqueue

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in DASK

Related Questions in SLURM

Related Questions in DASK-DISTRIBUTED

Related Questions in DASK-JOBQUEUE

Trending Questions

Popular # Hahtags

Popular Questions