I am trying to setup a 3 node cluster using Slurm following this tutorial: https://github.com/SergioMEV/slurm-for-dummies
However, after successfully installing MUNGE, and generating the slurm.conf file using the slurm configurator tool (https://slurm.schedmd.com/configurator.html), I am stuck at this point:
OS: Proxmox VE 8.1.3 x86_64
Output of 'systemctl status slurmctld.service':
_ slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; preset: enabled)
Active: active (running) since Thu 2024-01-11 15:10:45 IST; 7min ago
Docs: man:slurmctld(8)
Main PID: 3484 (slurmctld)
Tasks: 6
Memory: 3.1M
CPU: 57ms
CGroup: /system.slice/slurmctld.service
├─3484 /usr/sbin/slurmctld -D -s
└─3488 "slurmctld: slurmscriptd"
Jan 11 15:10:45 server2 systemd[1]: Started slurmctld.service - Slurm controller daemon.
Jan 11 15:10:46 server2 slurmctld[3484]: slurmctld: slurmctld version 22.05.8 started on cluster dlabcluster
Jan 11 15:10:46 server2 slurmctld[3484]: slurmctld: error: mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:195: pmi/pmix: can not load PMIx library
Jan 11 15:10:48 server2 slurmctld[3484]: slurmctld: error: Invalid RPC received REQUEST_TRIGGER_PULL while in standby mode
Jan 11 15:10:48 server2 slurmctld[3484]: slurmctld: error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode
Contents of slurm.conf:
ClusterName=DlabCluster
SlurmctldHost=server1
SlurmctldHost=server2
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP