Slurm
======

⦾ **What to do if slurmd services do not start after executing** ``omnia.yml`` **playbook?**

**Resolution**: Run the following command to manually restart slurmd services on the nodes ::

    systemctl restart slurmd


⦾ **What to do when Slurm services do not start automatically after the cluster reboots:**

**Resolution**:

* Manually restart the slurmd services on the kube_control_plane by running the following commands: ::

    systemctl restart slurmdbd
    systemctl restart slurmctld
    systemctl restart prometheus-slurm-exporter

* Run ``systemctl status slurmd`` to manually restart the following service on all the cluster nodes.


⦾ **What to do if new slurm node is not added to sinfo output of slurm control node when** ``restart_slurm_services`` **in the** ``omnia_config.yml`` **is set to** ``false`` **?**

**Resolution**:

* Run the following command on slurm control node: ::

    systemctl restart slurmctld

* Verify if the slurm node was added, using: ::

    sinfo


⦾ **Why do Slurm services fail?**

**Potential Cause**: The ``slurm.conf`` is not configured properly.

**Resolution**:

1. Run the following commands: ::

     slurmdbd -Dvvv
     slurmctld -Dvvv

2. Refer the ``/var/lib/log/slurmctld.log`` file for more information.


⦾ **What causes the** ``Ports are Unavailable`` **error?**

**Potential Cause:** Slurm database connection fails.

**Resolution:**

1. Run the following commands: ::

     slurmdbd -Dvvv
     slurmctld -Dvvv

2. Refer the ``/var/lib/log/slurmctld.log`` file.

3. Check the output of ``netstat -antp | grep LISTEN`` for  PIDs in the listening state.

4. If PIDs are in the **Listening** state, kill the processes of that specific port.

5. Restart all Slurm services: ::

    slurmctl restart slurmctld on slurm_control_node
    systemctl restart slurmdbd on slurm_control_node
    systemctl restart slurmd on slurm_node


⦾ **What to do if slurmctld services fails during** ``omnia.yml`` **execution, when** ``slurm_installaton_type`` **is** ``nfs_share`` **?**

**Potential Cause**: This issue may arise due to internal network issues.

**Resolution**: Re-run the playbook with same configuration and verify the status of slurmctld service in the slurm control node.

⦾ **Why does the** ``TASK: Install packages for slurm`` **fail with the following error message?**

.. image:: ../../../images/slurm_epel.png

**Potential Cause**: This error can happen:

    * Due to intermittent connectivity issues with the EPEL8 repositories from where the Slurm packages are downloaded.
    * Due to Slurm packages not downloaded successfully during ``local_repo.yml`` execution.

**Resolution**:

    * While installing Slurm, Omnia recommends users to proceed with ``always`` or ``partial`` scenarios of ``repo_config`` in ``input/software_config.json``.
    * If the user still wants to proceed with the ``never`` scenario, they must wait for the EPEL8 repositories to be reachable and then re-run the ``local_repo.yml`` playbook to download and install the Slurm packages.
    * If the user doesn't want to wait, they can change ``repo_config`` in ``input/software_config.json`` to ``always`` or ``partial``, execute ``oim_cleanup.yml``, and then re-run ``local_repo.yml`` to download the Slurm packages. After the packages are downloaded successfully, users need to provision the cluster and run ``omnia.yml`` to install the slurm packages on the cluster nodes.

⦾ **Why does the** ``job_based_user_access.yml`` **playbook fail while configuring the** `Slurm PAM module <https://slurm.schedmd.com/pam_slurm_adopt.html>`_ **in either configless or NFS mode?**

**Resolution**: This is a known issue, and Omnia team is actively working on a solution.