Troubleshooting guide

Troubleshooting Kubeadm

For a complete guide to troubleshooting kubeadm, click here.

Connecting to internal databases

TimescaleDB
- Start a bash session within the timescaledb pod: kubectl exec -it pod/timescaledb-0 -n telemetry-and-visualizations -- /bin/bash
- Connect to psql using the psql -u <postgres_username> command.
- Connect to database using the \c telemetry_metrics command.
MySQL DB
- Start a bash session within the mysqldb pod using the kubectl exec -it pod/mysqldb-0 -n telemetry-and-visualizations -- /bin/bash command.
- Connect to mysql using the mysql -u <mysqldb_username> command and provide password when prompted.
- Connect to database using the USE idrac_telemetrysource_services_db command.

Checking and updating encrypted parameters

Move to the filepath where the parameters are saved (as an example, we will be using provision_config_credentials.yml):
```
cd input/
```

To view the encrypted parameters:

ansible-vault view provision_config_credentials.yml --vault-password-file .provision_credential_vault_key

To edit the encrypted parameters:

ansible-vault edit provision_config_credentials.yml --vault-password-file .provision_credential_vault_key

Checking pod status from the OIM

Use this command to get a list of all available pods: kubectl get pods -A

Check the status of any specific pod by running: kubectl describe pod <pod name> -n <namespace name>

Using telemetry information to diagnose node issues

Regular telemetry metrics
Metric Name	Unit	Possible Values	Possible error causes
BlockedProcesses	processes	Metric Value No Data	This could happen if the `/proc/stat` file is inaccessible.
CPUSystem	seconds	Metric Value No Data	This could happen when the `psutil` library encounters errors.
CPUWait	seconds	Metric Value No Data	This could happen when the `psutil` library encounters errors.
ErrorsRecv		Metric Value No Data	This could happen when the `psutil` library encounters errors.
ErrorsSent		Metric Value No Data	This could happen when the `psutil` library encounters errors.
FailedJobs		Metric Value No Data	Slurm is not installed.
HardwareCorruptedMemory	kB	Metric Value No Data	This could happen if the `/proc/meminfo` file is inaccessible.
MemoryActive	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryAvailable	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryCached	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryFree	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryInactive	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryPercent	percent	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryShared	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryTotal	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryUsed	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
NodesDown		Metric Value No Data	Slurm is not installed.
NodesTotal		Metric Value No Data	Slurm is not installed.
NodesUp		Metric Value No Data	Slurm is not installed.
QueuedJobs		Metric Value No Data	Slurm is not installed.
RunningJobs		Metric Value No Data	Slurm is not installed.
SMARTHDATemp	C	Metric Value No Data	`smartctl` commands failed.
UniqueUserLogin		Metric Value No Data

Health telemetry metrics
Metric Name	Possible value(s)	Possible failure causes
dmesg	Unknown Fail Pass	[Unknown] The dmesg command was not found on the cluster node. [Fail] The dmesg command returned an error log message.
beegfs -beegfsstat	Unknown Fail Pass	[Unknown] BeeGFS is not installed or inactive. [Fail] The BeeGFS client service has failed or the node is not present in reachable lists of BeeGFS clients.
gpu_driver_health:gpu	Unknown Fail Pass	AMD/NVIDIA/Intel accelerators are not present. GPU drivers are not installed.
gpu_health_nvlink:gpu [1]	Unknown Fail Pass	AMD/NVIDIA/Intel accelerators are not present. NVLinks are not present. GPU drivers are not installed.
gpu_health_pcie:gpu	Unknown Fail Pass	AMD/NVIDIA/Intel accelerators are not present. GPU drivers are not installed.
gpu_health_pmu:gpu	Unknown Fail Pass	AMD/NVIDIA/Intel accelerators are not present. GPU drivers are not installed.
gpu_health_power:gpu	Unknown Fail Pass	AMD/NVIDIA/Intel accelerators are not present. GPU drivers are not installed.
gpu_health_thermal:gpu	Unknown Metric Value	AMD/NVIDIA/Intel accelerators are not present. GPU drivers are not installed.
Kubernetespodsstatus	Unknown Fail Pass	Kubernetes is not installed.
Kuberneteschildnode	Unknown Fail Pass	Kubernetes is not installed.
kubernetesnodesstatus	Unknown Fail Pass	Kubernetes is not installed.
kubernetescomponentsstatus	Unknown Fail Pass	Kubernetes is not installed.
Smart	Unknown Fail Pass	smartctl commands failed.

GPU telemetry metrics
Metric Name	Unit	Possible value(s)	Potential error cause(s)
gpu_temperature:gpu	C	Metric value No data	AMD/NVIDIA/Intel accelerators are not present. GPU drivers are not installed.
gpu_utilization	percent	Metric value No data	AMD/NVIDIA/Intel accelerators are not present. GPU drivers are not installed.
gpu_utilization:average	percent	Metric value No data	AMD/NVIDIA/Intel accelerators are not present. GPU drivers are not installed.

Troubleshooting image download failures while executing local_repo.yml playbook

If you encounter image download failures while executing local_repo.yml, do the following to resolve the issue:

Check if docker pull limit has been reached by manually trying to download an image. Provide docker credentials in provision_config_credentials.yml and re-run local_repo.yml playbook. Else execute nerdctl login manually.
Run the following command:
systemctl status nerdctl-registry
Expected output:

Else run:
systemctl restart nerdctl-registry
Run the following command:
nerdctl ps -a | grep omnia-registry
Expected output:

Else run:
systemctl restart nerdctl-registry
Run the following command:
curl -k https://<OIM_hostname>:5001/v2/_catalog
Expected outputs:

Empty list

Else, do the following:

Restart the OIM and check curl command output again.

Re-run local_repo.yml.
Run the following command:
openssl s_client -showcerts -connect <OIM_hostname>:5001
Expected output:

Verify that the certificate is valid and CN=private_registry.

Certificate shown by this command output should be the same as output present at /etc/containerd/certs.d/<OIM_hostname>5001/ca.crt.

If no certificate is visible on screen, run the following command:
systemctl restart nerdctl-registry

Troubleshooting task failures during omnia.yml playbook execution

During the execution of the omnia.yml playbook, if a task fails for any host listed in the inventory, it has the potential to trigger a cascading effect, leading to subsequent tasks in the playbook also failing.

In this scenario, the user needs to troubleshoot the initial point of failure, that is, the first task that failed.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.