I have a csh script that looks like this
foreach n (`seq 1 1000000`)
./myprog${n}.x
end
I want to parallelize it and run it on my slurm cluster, and because each instance of the program requires only 1 core, I want to use a node (or a few nodes) to run many at a time
#!/bin/csh
#SBATCH --nodes=8
#SBATCH -n 1024
#SBATCH --ntasks-per-node=128
foreach n (`seq 1 1000000`)
srun -N 1 -n 1 ./myprog${n}.x &
end
wait
When I do this, it seems like it's only running 1 at a time on a given node, although it's difficult to tell. Is there an option I can add to srun or an #SBATCH header I can add that will allow me to run on all of the cores I've requested?
How you do this can vary with the version of Slurm that is running. However, one example is given at:
https://docs.archer2.ac.uk/user-guide/scheduler/#example-4-256-serial-tasks-running-across-two-nodes
Note: this assumes you have exclusive node access. Essentially, you loop over nodes assigned to the job and then loop over the tasks you want to place on them. Example job submission script based on your one (note: you will need to modify the
--memoption to be a suitable value for the total amount of memory available on the compute nodes you are using).This does not manage to do the 100000 tasks you originally specified but you should be able to come up with some arithmetic to make this work so you split your total number of tasks across the number of nodes you are assigned (or you can setup a set of jobs that end up with exactly the right number of tasks per node).