Slurm throwing a PMIx error in a cluster (nodes running Proxmox VE)

82 Views Asked by At

I am trying to setup a 3 node cluster using Slurm following this tutorial: https://github.com/SergioMEV/slurm-for-dummies

However, after successfully installing MUNGE, and generating the slurm.conf file using the slurm configurator tool (https://slurm.schedmd.com/configurator.html), I am stuck at this point:

OS: Proxmox VE 8.1.3 x86_64

Output of 'systemctl status slurmctld.service':

_ slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; preset: enabled)
     Active: active (running) since Thu 2024-01-11 15:10:45 IST; 7min ago
       Docs: man:slurmctld(8)
   Main PID: 3484 (slurmctld)
      Tasks: 6
     Memory: 3.1M
        CPU: 57ms
     CGroup: /system.slice/slurmctld.service
             ├─3484 /usr/sbin/slurmctld -D -s
             └─3488 "slurmctld: slurmscriptd"

Jan 11 15:10:45 server2 systemd[1]: Started slurmctld.service - Slurm controller daemon.
Jan 11 15:10:46 server2 slurmctld[3484]: slurmctld: slurmctld version 22.05.8 started on cluster dlabcluster
Jan 11 15:10:46 server2 slurmctld[3484]: slurmctld: error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:195: pmi/pmix: can not load PMIx library
Jan 11 15:10:48 server2 slurmctld[3484]: slurmctld: error: Invalid RPC received REQUEST_TRIGGER_PULL while in standby mode
Jan 11 15:10:48 server2 slurmctld[3484]: slurmctld: error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode

Contents of slurm.conf:

ClusterName=DlabCluster
SlurmctldHost=server1
SlurmctldHost=server2
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
0

There are 0 best solutions below