Metrics collected

Regular metrics

Your cluster in numbers: Regular metrics include information such as CPU, memory, packets errors, drives etc.

Regular metrics

Metric Name

Unit

Possible Values

Possible error causes

BlockedProcesses

processes

  • Metric Value

  • No Data

  • This could happen if the /proc/stat file is inaccessible.

CPUSystem

seconds

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

CPUWait

seconds

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

ErrorsRecv

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

ErrorsSent

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

FailedJobs

  • Metric Value

  • No Data

  • Slurm is not installed.

HardwareCorruptedMemory

kB

  • Metric Value

  • No Data

  • This could happen if the /proc/meminfo file is inaccessible.

MemoryActive

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryAvailable

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryCached

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryFree

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryInactive

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryPercent

percent

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryShared

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryTotal

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryUsed

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

NodesDown

  • Metric Value

  • No Data

  • Slurm is not installed.

NodesTotal

  • Metric Value

  • No Data

  • Slurm is not installed.

NodesUp

  • Metric Value

  • No Data

  • Slurm is not installed.

QueuedJobs

  • Metric Value

  • No Data

  • Slurm is not installed.

RunningJobs

  • Metric Value

  • No Data

  • Slurm is not installed.

SMARTHDATemp

C

  • Metric Value

  • No Data

  • smartctl commands failed.

UniqueUserLogin

  • Metric Value

  • No Data

Health metrics

The health of your cluster: Health metrics include key performance indicators.

Health metrics

Metric Name

Possible value(s)

Possible failure causes

dmesg

  • Unknown

  • Fail

  • Pass

  • [Unknown] The dmesg command was not found on the cluster node.

  • [Fail] The dmesg command returned an error log message.

beegfs -beegfsstat

  • Unknown

  • Fail

  • Pass

  • [Unknown] BeeGFS is not installed or inactive.

  • [Fail] The BeeGFS client service has failed or the node is not present in reachable lists of BeeGFS clients.

gpu_driver_health:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_nvlink:gpu [1]

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • NVLinks are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_pcie:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_pmu:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_power:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_thermal:gpu

  • Unknown

  • Metric Value

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

Kubernetespodsstatus

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

Kuberneteschildnode

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

kubernetesnodesstatus

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

kubernetescomponentsstatus

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

Smart

  • Unknown

  • Fail

  • Pass

  • smartctl commands failed.

GPU metrics

The GPUs of your cluster: GPU metrics include information about GPUs in the cluster

GPU metrics

Metric Name

Unit

Possible value(s)

Potential error cause(s)

gpu_temperature:gpu

C

  • Metric value

  • No data

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_utilization

percent

  • Metric value

  • No data

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_utilization:average

percent

  • Metric value

  • No data

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.