Releases

1.7.1

  • Enablement of AMD 17G servers - R6725, R7725, R6715, R7715

  • Enablement of Intel Gaudi 3 accelerator

  • Enablement of NVIDIA accelerators - L40s, H100 NVL, H200 SXM

  • Support for Ubuntu 24.04 OS

  • Support for upgrading Omnia version on the OIM, from 1.7 to 1.7.1

  • Support for NVIDIA GPU operator (25.3.0) on nodes running Ubuntu 24.04 OS

  • Support for adding external nodes (with pre-loaded OS and internet connectivity) to a Kubernetes cluster

  • Support for configuring additional NICs and updating kernel parameters during the provisioning of the cluster nodes

  • Support for NVIDIA Collective Communications Library (NCCL) 2.25.1 on nodes with NVIDIA accelerators running Ubuntu 24.04 OS

  • Support for ROCm Communication Collectives Library (RCCL) 2.21.5 on nodes with AMD accelerators

  • Support for Multus-CNI plugin (4.1.4) and Whereabouts plugin (0.8.0) for Kubernetes (K8s)

  • Support for RoCE configuration with Calico network plugin

  • Updated software packages for Omnia 1.7.1:

    • Intel Gaudi driver - 1.19.2

    • Kubernetes - 1.31.4

    • Kubespray - 2.27

    • CSI PowerScale driver - 2.13.0

    • NVIDIA CUDA - 12.8

    • NVIDIA vLLM - 0.7.2

    • AMD ROCm - 6.3.1

    • Grafana - 11.4.1

    • BCM RoCE - 232.1.133.2 with below additional packages:

      • niccli_232.0.153.0-1_x86_64.deb

      • bnxt_re_conf_232.0.155.5-1_all.deb

1.7

  • Omnia now executes exclusively within a virtual environment created by the prereq.sh script

  • Python version upgraded to 3.11 (Previously 3.9)

  • Ansible version upgraded to 9.5.1 (Previously 7.7.0)

  • Kubernetes version upgraded to 1.29.5 (Previously 1.26.12)

  • Pre-enablement for Intel Gaudi 3 accelerators:

    • Software stack installation (See the support matrix for the supported Intel firmware version)

    • Accelerator status verification using HCCL and hl_qual

    • Inventory tagging for the Gaudi accelerators (compute_gpu_intel)

    • Monitoring for the Gaudi accelerators via:

      • Omnia telemetry

      • iDRAC telemetry

      • Kubernetes telemetry via Prometheus exporter

    • Visualization of the Kubernetes telemetry and Intel Gaudi accelerator metrics using Grafana

    • AI tools enablement:

      • DeepSpeed

      • Kubeflow

      • vLLM

  • Sample playbook for a pre-trained Generative AI model - Llama 3.1

  • CSI drivers for Kubernetes to access PowerScale storage with an option to enable the SmartConnect feature (without SSL certificates)

  • Added support for NVIDIA container toolkit for NVIDIA accelerators in a Kubernetes cluster

  • Added support for corporate proxy on RHEL, Rocky Linux, and Ubuntu clusters

  • Set OS Kernel command-line parameters and/or configure additional NICs on the nodes using a single playbook

  • The internal OpenLDAP server can now be configured as a proxy server

1.6.1

Omnia v1.6.1 addresses an issue caused due to the unavailability of the dependent package ‘libssl1.1_1.1.1f-1ubuntu2.22_amd64’ required by Omnia v1.6 for the Ubuntu 22.04 operating system. The focus of this release is to resolve this issue and ensure the proper functionality of Omnia on Ubuntu 22.04 OS.

1.6

1.5.1

  • Omnia now installs Kubernetes 1.26.

1.5

  • Extensive Telemetry and Monitoring has been added to the Omnia stack, intended for consumption by customers that are using Dell systems and Omnia to provide SaaS/IaaS solutions. These include, but are not limited to:

– CPU Utilization and status

– GPU utilization

– Node Count

– Network Packet I/O

– HDD capacity and free space

– Memory capacity and utilization

– Queued and Running Job Count

– User Count

– Cluster HW Health Checks (PCIE, NVLINK, BMC, Temps)

– Cluster SW Health Checks (dmesg, BeeGFS, k8s nodes/pods, mySQL on control plane)

  • Metrics are extracted using a combination of the following: PSUtil, Smartctl, beegfs-ctl, nvidia-smi, rocm-smi. Since groundwork is already laid, additional requests from these tools will be quicker to implement in the future.

  • Telemetry and health checks can be optionally disabled.

  • Log Aggregation via xCAT syslog:

– Aggregated on control plane, grouping default is “severity” with others available.

– Uses Grafani-Loki for viewing.

  • Omnia github now hosts a “genesis” image with this functionality baked in for initial bootup.

  • Host aliasing for Scheduler and IPA authentication.

  • Login and kube_control_plane access from both public and private NIC.

  • Validation check enhancements:

  • Rearranged to occur as early as possible.

  • Isolate checks when running smaller playbooks.

1.4.3

  • XE9640, R760xa, R760xd2 are now supported as control planes or target nodes with NVIDIA H100 accelerators.

  • Added ability for split port configuration on NVIDIA Quantum-2-based QM9700 (NVIDIA InfiniBand NDR400 switches).

  • Extended password-less SSH support for multiple user configuration in a single execution.

  • Input mapping files and inventory files now support commented entries for customized playbook execution.

  • NFS share is now available for hosting user home directories within the cluster.

1.4.2

  • XE9680, R760, R7625, R6615, R7615 are now supported as control planes or target nodes.

  • Added ability for switch-based discovery of remote servers and PXE provisioning.

  • Active RedHat subscription is no longer required on the control plane and the cluster nodes. Users can configure and use local RHEL repositories.

  • IP ranges can be defined for assignment to remote nodes when discovered via the switch.

1.4.1

  • R660, R6625 and C6620 platforms are now supported as control planes or target nodes.

  • One touch provisioning now allows for OFED installation, NVIDIA CUDA-toolkit installation along with iDRAC and InfiniBand IP configuration on target nodes.

  • Potential servers can now be discovered via iDRAC.

  • Servers can be provisioned automatically without manual intervention for booting/PXE settings.

  • Target node provisioning status can now be checked on the control plane by viewing the OmniaDB.

  • Omnia clusters can be configured with password-less SSH for seamless execution of HPC jobs run by non-root users.

  • Accelerator drivers can be installed on Rocky Linux target nodes in addition to RHEL.

1.4

  • Provisioning of remote nodes through PXE boot by providing TOR switch IP

  • Provisioning of remote nodes through PXE boot by providing mapping file

  • PXE provisioning of remote nodes through admin NIC or shared LOM NIC

  • Database update of mac address, hostname and admin IP

  • Optional monitoring support(Grafana installation) on control plane

  • OFED installation on the remote nodes

  • CUDA installation on the remote nodes

  • AMD accelerator and ROCm support on the remote nodes

  • Omnia playbook execution with Kubernetes, Slurm, and FreeIPA installation in all cluster nodes

  • Infiniband switch configuration and split port functionality

  • Added support for Ethernet Z series switches.

1.3

  • CLI support for all Omnia playbooks (AWX GUI is now optional/deprecated).

  • Automated discovery and configuration of all devices (including PowerVault, InfiniBand, and ethernet switches) in shared LOM configuration.

  • Job based user access with Slurm.

  • AMD server support (R6415, R7415, R7425, R6515, R6525, R7515, R7525, C6525).

  • PowerVault ME5 series support (ME5012, ME5024, ME5084).

  • PowerVault ME4 and ME5 SAS Controller configuration and NFS server, client configuration.

  • NFS bolt-on support.

  • BeeGFS bolt-on support.

  • Lua and Lmod installation on manager and compute nodes running RedHat 8.x, Rocky Linux 8.x and Leap 15.3.

  • Automated setup of FreeIPA client on all nodes.

  • Automate configuration of PXE device settings (active NIC) on iDRAC.

1.2.2

  • Bugfix patch release to address AWX Inventory not being updated.

1.2.1

  • HPC cluster formation using shared LOM network

  • Supporting PXE boot on shared LOM network as well as high speed Ethernet or InfiniBand path.

  • Support for BOSS Control Card

  • Support for RHEL 8.x with ability to activate the subscription

  • Ability to upgrade Kernel on RHEL

  • Bolt-on Support for BeeGFS

1.2.0.1

  • Bugfix patch release which address the broken cobbler container issue.

  • Rocky Linux 8.6 Support

1.2

  • Omnia supports Rocky Linux 8.5 full OS on the Control Plane

  • Omnia supports ansible version 2.12 (ansible-core) with python 3.6 support

  • All packages required to enable the HPC/AI cluster are deployed as a pod on control plane

  • Omnia now installs Grafana as a single pane of glass to view logs, metrics and telemetry visualization

  • cluster node provisioning can be done via PXE and iDRAC

  • Omnia supports multiple operating systems on the cluster including support for Rocky Linux 8.5 and OpenSUSE Leap 15.3

  • Omnia can deploy cluster nodes with a single NIC.

  • All Cluster metrics can be viewed using Grafana on the Control plane (as opposed to checking the kube_control_plane on each cluster)

  • AWX node inventory now displays service tags with the relevant operating system.

  • Omnia adheres to most of the requirements of NIST 800-53 and NIST 800-171 guidelines on the control plane and login node.

  • Omnia has extended the FreeIPA feature to provide authentication and authorization on Rocky Linux Nodes.

  • Omnia uses [389ds}(https://directory.fedoraproject.org/) to provide authentication and authorization on Leap Nodes.

  • Email Alerts have been added in case of login failures.

  • Administrator can restrict users or hosts from accessing the control plane and login node over SSH.

  • Malicious or unwanted network software access can be restricted by the administrator.

  • Admins can restrict the idle time allowed in an ssh session.

  • Omnia installs apparmor to restrict program access on leap nodes.

  • Security on audit log access is provided.

  • Program execution on the control plane and login node is logged using snoopy tool.

  • User activity on the control plane and login node is monitored using psacct/acct tools installed by Omnia

  • Omnia fetches key performance indicators from iDRACs present in the cluster

  • Omnia also supports fetching performance indicators on the nodes in the cluster when SLURM jobs are running.

  • The telemetry data is plotted on Grafana to provide better visualization capabilities.

  • Four visualization plugins are supported to provide and analyze iDRAC and Slurm data.

    • Parallel Coordinate

    • Spiral

    • Sankey

    • Stream-net (aka. Power Map)

  • In addition to the above features, changes have been made to enhance the performance of Omnia.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.