Scheduler

Input Parameters for the Cluster

These parameters is located in input/omnia_config.yml

Parameter Name

Values

Additional Information

freeipa_required

true, false

Boolean indicating whether FreeIPA is required or not.

realm_name

OMNIA.TEST

Sets the intended realm name

directory_manager_password

Password authenticating admin level access to the Directory for system management tasks. It will be added to the instance of directory server created for IPA.Required Length: 8 characters. The password must not contain -,, ‘,”

kerberos_admin_password

“admin” user password for the IPA server on RockyOS.

domain_name

omnia.test

Sets the intended domain name

ldap_required

false, true

Boolean indicating whether ldap client is required or not

ldap_server_ip

LDAP server IP. Required if ldap_required is true. There should be an explicit LDAP server running on this IP.

ldap_connection_type

TLS

For a TLS connection, provide a valid certification path. For an SSL connection, ensure port 636 is open.

ldap_ca_cert_path

/etc/openldap/certs/omnialdap.pem

This variable accepts Server Certificate Path. Make sure certificate is present in the path provided. The certificate should have .pem or .crt extension. This variable is mandatory if connection type is TLS.

user_home_dir

/home

This variable accepts the user home directory path for ldap configuration. If nfs mount is created for user home, make sure you provide the LDAP users mount home directory path.

ldap_bind_username

admin

If LDAP server is configured with bind dn then bind dn user to be provided. If this value is not provided (when bind is configured in server) then ldap authentication fails. Omnia does not validate this input. Ensure that it is valid and proper.

ldap_bind_password

If LDAP server is configured with bind dn then bind dn password to be provided. If this value is not provided (when bind is configured in server) then ldap authentication fails. Omnia does not validate this input. Ensure that it is valid and proper.

enable_secure_login_node

false, true

Boolean value deciding whether security features are enabled on the Login Node.

Before You Build Clusters

  • Verify that all inventory files are updated.

  • If the target cluster requires more than 10 kubernetes nodes, use a docker enterprise account to avoid docker pull limits.

  • Verify that all nodes are assigned a group. Use the inventory as a reference.

    • The manager group should have exactly 1 manager node.

    • The compute group should have at least 1 node.

    • The login_node group is optional. If present, it should have exactly 1 node.

    • Users should also ensure that all repos are available on the target nodes running RHEL.

Note

The inventory file accepts both IPs and FQDNs as long as they can be resolved by DNS.

  • For RedHat clusters, ensure that RedHat subscription is enabled on all target nodes. Every target node will require a RedHat subscription.

Features enabled by omnia.yml

  • Slurm: Once all the required parameters in omnia_config.yml are filled in, omnia.yml can be used to set up slurm.

  • Login Node (Additionally secure login node)

  • Kubernetes: Once all the required parameters in omnia_config.yml are filled in, omnia.yml can be used to set up kubernetes.

  • BeeGFS bolt on installation

  • NFS bolt on support

Building Clusters

  1. In the input/omnia_config.yml file, provide the required details.

Note

Without the login node, Slurm jobs can be scheduled only through the manager node.

  1. Create an inventory file in the omnia folder. Add login node IP address under the manager node IP address under the [manager] group, compute node IP addresses under the [compute] group, and Login node IP under the [login_node] group,. Check out the sample inventory for more information.

Note

  • Omnia checks for red hat subscription being enabled on RedHat nodes as a pre-requisite. Not having Red Hat subscription enabled on the manager node will cause omnia.yml to fail. If compute nodes do not have Red Hat subscription enabled, omnia.yml will skip the node entirely.

  • Omnia creates a log file which is available at: /var/log/omnia.log.

  • If only Slurm is being installed on the cluster, docker credentials are not required.

  1. To run omnia.yml:

ansible-playbook omnia.yml -i inventory

Note

  • To visualize the cluster (Slurm/Kubernetes) metrics on Grafana (On the control plane) during the run of omnia.yml, add the parameters grafana_username and grafana_password (That is ansible-playbook omnia.yml -i inventory -e grafana_username="" -e grafana_password=""). Alternatively, Grafana is not installed by omnia.yml if it’s not available on the Control Plane.

  • Having the same node in the manager and login_node groups in the inventory is not recommended by Omnia.

Using Skip Tags

Using skip tags, the scheduler running on the cluster can be set to Slurm or Kubernetes while running the omnia.yml playbook. This choice can be made depending on the expected HPC/AI workloads.

  • Kubernetes: ansible-playbook omnia.yml -i inventory --skip-tags "kubernetes" (To set Slurm as the scheduler)

  • Slurm: ansible-playbook omnia.yml -i inventory --skip-tags "slurm" (To set Kubernetes as the scheduler)

Note

  • If you want to view or edit the omnia_config.yml file, run the following command:

    • ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key – To view the file.

    • ansible-vault edit omnia_config.yml --vault-password-file .omnia_vault_key – To edit the file.

  • It is suggested that you use the ansible-vault view or edit commands and that you do not use the ansible-vault decrypt or encrypt commands. If you have used the ansible-vault decrypt or encrypt commands, provide 644 permission to omnia_config.yml.

Kubernetes Roles

As part of setting up Kubernetes roles, omnia.yml handles the following tasks on the manager and compute nodes:

  • Docker is installed.

  • Kubernetes is installed.

  • Helm package manager is installed.

  • All required services are started (Such as kubelet).

  • Different operators are configured via Helm.

  • Prometheus is installed.

Slurm Roles

As part of setting up Slurm roles, omnia.yml handles the following tasks on the manager and compute nodes:

  • Slurm is installed.

  • All required services are started (Such as slurmd, slurmctld, slurmdbd).

  • Prometheus is installed to visualize slurm metrics.

  • Lua and Lmod are installed as slurm modules.

  • Slurm restd is set up.

Login node

If a login node is available and mentioned in the inventory file, the following tasks are executed:

  • Slurmd is installed.

  • All required configurations are made to slurm.conf file to enable a slurm login node.

Hostname requirements
  • In the examples folder, a mapping_host_file.csv template is provided which can be used for DHCP configuration. The header in the template file must not be deleted before saving the file. It is recommended to provide this optional file as it allows IP assignments provided by Omnia to be persistent across control plane reboots.

  • The Hostname should not contain the following characters: , (comma), . (period) or _ (underscore). However, the domain name is allowed commas and periods.

  • The Hostname cannot start or end with a hyphen (-).

  • No upper case characters are allowed in the hostname.

  • The hostname cannot start with a number.

  • The hostname and the domain name (that is: hostname00000x.domain.xxx) cumulatively cannot exceed 64 characters. For example, if the node_name provided in input/provision_config.yml is ‘node’, and the domain_name provided is ‘omnia.test’, Omnia will set the hostname of a target compute node to ‘node00001.omnia.test’. Omnia appends 6 digits to the hostname to individually name each target node.

Note

  • To enable the login node, ensure that login_node_required in input/omnia_config.yml is set to true.

Slurm job based user access

To ensure security while running jobs on the cluster, users can be assigned permissions to access compute nodes only while their jobs are running. To enable the feature:

cd scheduler
ansible-playbook job_based_user_access.yml -i inventory

Note

  • The inventory queried in the above command is to be created by the user prior to running omnia.yml as scheduler.yml is invoked by omnia.yml

  • Only users added to the ‘slurm’ group can execute slurm jobs. To add users to the group, use the command: usermod -a -G slurm <username>.

Running Slurm MPI jobs on clusters

To enhance the productivity of the cluster, Slurm allows users to run jobs in a parallel-computing architecture. This is used to efficiently utilize all available computing resources.

Note

  • Omnia does not install MPI packages by default. Users hoping to leverage the Slurm-based MPI execution feature are required to install the relevant packages from a source of their choosing.

  • Running jobs as individual users (and not as root) requires that passwordSSH be enabled between compute nodes for the user.

For Intel

To run an MPI job on an intel processor, set the following environmental variables on the head nodes or within the job script:

  • I_MPI_PMI_LIBRARY = /usr/lib64/pmix/

  • FI_PROVIDER = sockets (When InfiniBand network is not available, this variable needs to be set)

  • LD_LIBRARY_PATH (Use this variable to point to the location of the Intel/Python library folder. For example: $LD_LIBRARY_PATH:/mnt/jobs/intelpython/python3.9/envs/2022.2.1/lib/)

For AMD

To run an MPI job on an AMD processor, set the following environmental variables on the head nodes or within the job script:

  • PATH (Use this variable to point to the location of the OpenMPI binary folder. For example: PATH=$PATH:/appshare/openmpi/bin)

  • LD_LIBRARY_PATH (Use this variable to point to the location of the OpenMPI library folder. For example: $LD_LIBRARY_PATH:/appshare/openmpi/lib)

  • OMPI_ALLOW_RUN_AS_ROOT = 1 (To run jobs as a root user, set this variable to 1)

  • OMPI_ALLOW_RUN_AS_ROOT_CONFIRM = 1 (To run jobs as a root user, set this variable to 1)