Slurm
⦾ What to do if slurmd services do not start after executing omnia.yml playbook?
Resolution: Run the following command to manually restart slurmd services on the nodes
systemctl restart slurmd
⦾ What to do when Slurm services do not start automatically after the cluster reboots:
Resolution:
Manually restart the slurmd services on the kube_control_plane by running the following commands:
systemctl restart slurmdbd systemctl restart slurmctld systemctl restart prometheus-slurm-exporter
Run
systemctl status slurmdto manually restart the following service on all the cluster nodes.
⦾ What to do if new slurm node is not added to sinfo output of slurm control node when restart_slurm_services in the omnia_config.yml is set to false ?
Resolution:
Run the following command on slurm control node:
systemctl restart slurmctld
Verify if the slurm node was added, using:
sinfo
⦾ Why do Slurm services fail?
Potential Cause: The slurm.conf is not configured properly.
Resolution:
Run the following commands:
slurmdbd -Dvvv slurmctld -Dvvv
Refer the
/var/lib/log/slurmctld.logfile for more information.
⦾ What causes the Ports are Unavailable error?
Potential Cause: Slurm database connection fails.
Resolution:
Run the following commands:
slurmdbd -Dvvv slurmctld -Dvvv
Refer the
/var/lib/log/slurmctld.logfile.Check the output of
netstat -antp | grep LISTENfor PIDs in the listening state.If PIDs are in the Listening state, kill the processes of that specific port.
Restart all Slurm services:
slurmctl restart slurmctld on slurm_control_node systemctl restart slurmdbd on slurm_control_node systemctl restart slurmd on slurm_node
⦾ What to do if slurmctld services fails during omnia.yml execution, when slurm_installaton_type is nfs_share ?
Potential Cause: This issue may arise due to internal network issues.
Resolution: Re-run the playbook with same configuration and verify the status of slurmctld service in the slurm control node.
⦾ Why does the TASK: Install packages for slurm fail with the following error message?
Potential Cause: This error can happen:
Due to intermittent connectivity issues with the EPEL8 repositories from where the Slurm packages are downloaded.
Due to Slurm packages not downloaded successfully during
local_repo.ymlexecution.
Resolution:
While installing Slurm, Omnia recommends users to proceed with
alwaysorpartialscenarios ofrepo_configininput/software_config.json.If the user still wants to proceed with the
neverscenario, they must wait for the EPEL8 repositories to be reachable and then re-run thelocal_repo.ymlplaybook to download and install the Slurm packages.If the user doesn’t want to wait, they can change
repo_configininput/software_config.jsontoalwaysorpartial, executeoim_cleanup.yml, and then re-runlocal_repo.ymlto download the Slurm packages. After the packages are downloaded successfully, users need to provision the cluster and runomnia.ymlto install the slurm packages on the cluster nodes.
⦾ Why does the job_based_user_access.yml playbook fail while configuring the Slurm PAM module in either configless or NFS mode?
Resolution: This is a known issue, and Omnia team is actively working on a solution.
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.