I'm encountering an issue while configuring Slurm in my distributed computing environment. When I launch a process that should only use 4 cores, it ends up blocking all 128 available cores on the node, preventing me from using them for other tasks.
My Slurm configuration includes resource allocation using directives like --nodes, --ntasks, and --cpus-per-task. Despite this, it seems that the process is occupying all cores on the node instead of adhering to the specific allocation.
Any ideas on why this might be happening or any additional configuration I should be mindful of to prevent a process from occupying all the cores on the node?
I appreciate any guidance or suggestions to resolve this issue. Thank you!
The two main reasons why this could be happening are often either (a) you implicitly/unknowingly request all resources of a compute node, or (b) the cluster is configured not to share compute nodes.
Regarding (a), often, the memory requirement is the culprit. If the cluster or the partition is configured with
DefMemPerCPUorDefMemPerNodeand you do not overwrite it in your submission script, you will prevent other jobs from using the node. Also make sure your environment does not contain variables that influence the resource allocation, e.g.$SBATCH_EXCLUSIVE.Regarding (b), check that
SelectTypeis notselect/linearand that the partition configuration do not haveOverSubscribe=EXCLUSIVE. If either is set, the nodes are not sharable between jobs.