Troubleshooting guide

Troubleshooting Kubeadm

For a complete guide to troubleshooting kubeadm, click here.

Connecting to internal databases

  • TimescaleDB
    • Start a bash session within the timescaledb pod: kubectl exec -it pod/timescaledb-0 -n telemetry-and-visualizations -- /bin/bash

    • Connect to psql using the psql -u <postgres_username> command.

    • Connect to database using the \c telemetry_metrics command.

  • MySQL DB
    • Start a bash session within the mysqldb pod using the kubectl exec -it pod/mysqldb-0 -n telemetry-and-visualizations -- /bin/bash command.

    • Connect to mysql using the mysql -u <mysqldb_username> command and provide password when prompted.

    • Connect to database using the USE idrac_telemetrysource_services_db command.

Checking and updating encrypted parameters

  1. Move to the filepath where the parameters are saved (as an example, we will be using provision_config_credentials.yml):

    cd input/
    
  2. To view the encrypted parameters:

    ansible-vault view provision_config_credentials.yml --vault-password-file .provision_credential_vault_key
    
  3. To edit the encrypted parameters:

    ansible-vault edit provision_config_credentials.yml --vault-password-file .provision_credential_vault_key
    

Checking pod status from the OIM

  • Use this command to get a list of all available pods: kubectl get pods -A

  • Check the status of any specific pod by running: kubectl describe pod <pod name> -n <namespace name>

Using telemetry information to diagnose node issues

Regular telemetry metrics

Metric Name

Unit

Possible Values

Possible error causes

BlockedProcesses

processes

  • Metric Value

  • No Data

  • This could happen if the /proc/stat file is inaccessible.

CPUSystem

seconds

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

CPUWait

seconds

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

ErrorsRecv

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

ErrorsSent

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

FailedJobs

  • Metric Value

  • No Data

  • Slurm is not installed.

HardwareCorruptedMemory

kB

  • Metric Value

  • No Data

  • This could happen if the /proc/meminfo file is inaccessible.

MemoryActive

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryAvailable

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryCached

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryFree

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryInactive

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryPercent

percent

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryShared

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryTotal

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryUsed

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

NodesDown

  • Metric Value

  • No Data

  • Slurm is not installed.

NodesTotal

  • Metric Value

  • No Data

  • Slurm is not installed.

NodesUp

  • Metric Value

  • No Data

  • Slurm is not installed.

QueuedJobs

  • Metric Value

  • No Data

  • Slurm is not installed.

RunningJobs

  • Metric Value

  • No Data

  • Slurm is not installed.

SMARTHDATemp

C

  • Metric Value

  • No Data

  • smartctl commands failed.

UniqueUserLogin

  • Metric Value

  • No Data

Health telemetry metrics

Metric Name

Possible value(s)

Possible failure causes

dmesg

  • Unknown

  • Fail

  • Pass

  • [Unknown] The dmesg command was not found on the cluster node.

  • [Fail] The dmesg command returned an error log message.

beegfs -beegfsstat

  • Unknown

  • Fail

  • Pass

  • [Unknown] BeeGFS is not installed or inactive.

  • [Fail] The BeeGFS client service has failed or the node is not present in reachable lists of BeeGFS clients.

gpu_driver_health:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA/Intel accelerators are not present.

  • GPU drivers are not installed.

gpu_health_nvlink:gpu [1]

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA/Intel accelerators are not present.

  • NVLinks are not present.

  • GPU drivers are not installed.

gpu_health_pcie:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA/Intel accelerators are not present.

  • GPU drivers are not installed.

gpu_health_pmu:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA/Intel accelerators are not present.

  • GPU drivers are not installed.

gpu_health_power:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA/Intel accelerators are not present.

  • GPU drivers are not installed.

gpu_health_thermal:gpu

  • Unknown

  • Metric Value

  • AMD/NVIDIA/Intel accelerators are not present.

  • GPU drivers are not installed.

Kubernetespodsstatus

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

Kuberneteschildnode

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

kubernetesnodesstatus

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

kubernetescomponentsstatus

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

Smart

  • Unknown

  • Fail

  • Pass

  • smartctl commands failed.

GPU telemetry metrics

Metric Name

Unit

Possible value(s)

Potential error cause(s)

gpu_temperature:gpu

C

  • Metric value

  • No data

  • AMD/NVIDIA/Intel accelerators are not present.

  • GPU drivers are not installed.

gpu_utilization

percent

  • Metric value

  • No data

  • AMD/NVIDIA/Intel accelerators are not present.

  • GPU drivers are not installed.

gpu_utilization:average

percent

  • Metric value

  • No data

  • AMD/NVIDIA/Intel accelerators are not present.

  • GPU drivers are not installed.

Troubleshooting image download failures while executing local_repo.yml playbook

If you encounter image download failures while executing local_repo.yml, do the following to resolve the issue:

  1. Check if docker pull limit has been reached by manually trying to download an image. Provide docker credentials in provision_config_credentials.yml and re-run local_repo.yml playbook. Else execute nerdctl login manually.

  2. Run the following command:

    systemctl status nerdctl-registry
    

    Expected output:

    ../_images/image_failure_output_s2.png

    Else run:

    systemctl restart nerdctl-registry
    
  3. Run the following command:

    nerdctl ps -a | grep omnia-registry
    

    Expected output:

    ../_images/image_failure_output_s3.png

    Else run:

    systemctl restart nerdctl-registry
    
  4. Run the following command:

    curl -k https://<OIM_hostname>:5001/v2/_catalog
    

    Expected outputs:

    1. ../_images/image_failure_output_s4.png
    2. Empty list

    Else, do the following:

    1. Restart the OIM and check curl command output again.

    2. Re-run local_repo.yml.

  5. Run the following command:

    openssl s_client -showcerts -connect <OIM_hostname>:5001
    

    Expected output:

    ../_images/image_failure_output_s5.png
    • Verify that the certificate is valid and CN=private_registry.

    • Certificate shown by this command output should be the same as output present at /etc/containerd/certs.d/<OIM_hostname>5001/ca.crt.

    If no certificate is visible on screen, run the following command:

    systemctl restart nerdctl-registry
    

Troubleshooting task failures during omnia.yml playbook execution

During the execution of the omnia.yml playbook, if a task fails for any host listed in the inventory, it has the potential to trigger a cascading effect, leading to subsequent tasks in the playbook also failing.

In this scenario, the user needs to troubleshoot the initial point of failure, that is, the first task that failed.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.