I am using mpi4py, and submit a job to Slurm where I have the following specification:
#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-core=2
module load --auto python/3.9.15-gcc-12.2.0-3sr5utz
module load --auto py-pandas/1.5.1-gcc-12.2.0-356d2ew
module load --auto py-scipy/1.8.1-gcc-12.2.0-7uvxgvy
module load --auto py-joblib/1.2.0-gcc-12.2.0-ecughwi
module load --auto py-mpi4py/3.1.3-gcc-12.2.0-xvabib2
mpirun --use-hwthread-cpus --np 1 python3 -u ./master.py 510
I want to start 1 master process and spawn 510 child process. Each node has 128 physical cores and 2 threads for each core. When requesting 2 nodes, I expect to get enough slots to spawn 510 child process, but I got an error
There are not enough slots available in the system to satisfy the 510
slots that were requested by the application
How can I utilize all the logical core here if I don't use oversubscribe?
Edit
If I use --oversubscribe, I got the following error.
[1706365618.473220] [n3511-017:303072:0] wireup.c:400 UCX ERROR ep 0x14b4ac04a180: no remote ep address for lane[2]->remote_lane[2]
[1706365618.473256] [n3511-017:303072:0] wireup.c:400 UCX ERROR ep 0x14b4ac04a080: no remote ep address for lane[2]->remote_lane[2]
[1706365618.473469] [n3511-017:303072:0] wireup.c:400 UCX ERROR ep 0x14b4ac04a100: no remote ep address for lane[2]->remote_lane[2]
[1706365618.473480] [n3511-017:303072:0] wireup.c:400 UCX ERROR ep 0x14b4ac04a240: no remote ep address for lane[2]->remote_lane[2]
[1706365618.480337] [n3511-017:303072:0] wireup.c:400 UCX ERROR ep 0x14b4ac04a480: no remote ep address for lane[2]->remote_lane[2]
[1706365618.480349] [n3511-017:303072:0] wireup.c:400 UCX ERROR ep 0x14b4ac04a040: no remote ep address for lane[2]->remote_lane[2]
[1706365618.480359] [n3511-017:303072:0] wireup.c:400 UCX ERROR ep 0x14b4ac04a540: no remote ep address for lane[2]->remote_lane[2]
[1706365618.480605] [n3511-029:76140:a] wireup.c:1071 UCX ERROR old: am_lane 0 wireup_msg_lane <none> cm_lane <none> keepalive_lane <none> reachable_mds 0x12
[1706365618.480649] [n3511-029:76140:a] wireup.c:1094 UCX ERROR old: lane[0]: 6:dc_mlx5/mlx5_0:1.0 md[4] -> md[4]/ib/sysdev[255] rma_bw#0 am am_bw#0
[1706365618.480659] [n3511-029:76140:a] wireup.c:1071 UCX ERROR new: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x12
[1706365618.480668] [n3511-029:76140:a] wireup.c:1094 UCX ERROR new: lane[0]: 10:ud_mlx5/mlx5_0:1.0 md[4] -> md[4]/ib/sysdev[255] am am_bw#0 wireup
[n3511-029:76140:a:79452] wireup.c:1384 Fatal: endpoint reconfiguration not supported yet
[1706365618.481141] [n3511-029:76141:a] wireup.c:1071 UCX ERROR old: am_lane 0 wireup_msg_lane <none> cm_lane <none> keepalive_lane <none> reachable_mds 0x12
[1706365618.481172] [n3511-029:76141:a] wireup.c:1094 UCX ERROR old: lane[0]: 6:dc_mlx5/mlx5_0:1.0 md[4] -> md[4]/ib/sysdev[255] rma_bw#0 am am_bw#0
[1706365618.481182] [n3511-029:76141:a] wireup.c:1071 UCX ERROR new: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x12
[1706365618.481193] [n3511-029:76141:a] wireup.c:1094 UCX ERROR new: lane[0]: 10:ud_mlx5/mlx5_0:1.0 md[4] -> md[4]/ib/sysdev[255] am am_bw#0 wireup
[n3511-029:76141:a:78916] wireup.c:1384 Fatal: endpoint reconfiguration not supported yet
==== backtrace (tid: 79452) ====
0 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucs.so.0(ucs_handle_error+0x294) [0x14c65c229e94]
1 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucs.so.0(ucs_fatal_error_message+0xca) [0x14c65c22700a]
2 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucs.so.0(+0x2c0e1) [0x14c65c2270e1]
3 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucp.so.0(ucp_wireup_init_lanes+0xf77) [0x14c65c93d047]
4 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucp.so.0(+0x90359) [0x14c65c93d359]
5 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucp.so.0(+0x90be1) [0x14c65c93dbe1]
6 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1fa) [0x14c6581ec29a]
7 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/ucx/libuct_ib.so.0(+0x631cb) [0x14c6581f51cb]
8 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/ucx/libuct_ib.so.0(+0x5503b) [0x14c6581e703b]
9 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucs.so.0(+0x181dc) [0x14c65c2131dc]
10 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucs.so.0(ucs_async_dispatch_handlers+0x4c) [0x14c65c213e8c]
11 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucs.so.0(+0x1b9a6) [0x14c65c2169a6]
12 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucs.so.0(ucs_event_set_wait+0xa9) [0x14c65c2333e9]
13 /gpfs/opt/sw/zen/spack-0.19.0/opt/spack/linux-almalinux8-zen3/gcc-12.2.0/ucx-1.13.1-p4m2lkzrqib6jwxuqkqdocfmyka7bvoh/lib/libucs.so.0(+0x1bff7) [0x14c65c216ff7]
14 /lib64/libpthread.so.0(+0x81ca) [0x14c6715351ca]
15 /lib64/libc.so.6(clone+0x43) [0x14c670a17e73]