Kubernetes
⦾ Why do Kubernetes Pods show “ImagePullBack” or “ErrPullImage” errors in their status?
Potential Cause: The errors occur when the Docker pull limit is exceeded.
Resolution:
Ensure that the
docker_usernameanddocker_passwordare provided ininput/provision_config_credentials.yml.For a HPC cluster, during
omnia.ymlexecution, a kubernetes secret ‘dockerregcred’ will be created in default namespace and patched to service account. User needs to patch this secret in their respective namespace while deploying custom applications and use the secret as imagePullSecrets in yaml file to avoid ErrImagePull. Click here for more info.
Note
If the playbook is already executed and the pods are in ImagePullBack state, then run kubeadm reset -f in all the nodes before re-executing the playbook with the docker credentials.
⦾ What to do if the nodes in a Kubernetes cluster reboot?
Resolution: Wait for 15 minutes after the Kubernetes cluster reboots. Next, verify the status of the cluster using the following commands:
kubectl get nodeson the kube_control_plane to get the real-time kubernetes cluster status.kubectl get pods all-namespaceson the kube_control_plane to check which the pods are in the Running state.kubectl cluster-infoon the kube_control_plane to verify that both the kubernetes master and kubeDNS are in the Running state.
⦾ What to do when the Kubernetes services are not in “Running” state:
Resolution:
Run
kubectl get pods all-namespacesto verify that all pods are in the Running state.If the pods are not in the Running state, delete the pods using the command:
kubectl delete pods <name of pod>Run the corresponding playbook that was used to install Kubernetes:
omnia.yml,jupyterhub.yml, orkubeflow.yml.
⦾ Why do Kubernetes Pods stop communicating with the servers when the DNS servers are not responding?
Potential Cause: The host network is faulty causing DNS to be unresponsive
Resolution:
In your Kubernetes cluster, run
kubeadm reset -fon all the nodes.On the management node, edit the
omnia_config.ymlfile to change the Kubernetes Pod Network CIDR. The suggested IP range is 192.168.0.0/16. Ensure that the IP provided is not in use on your host network.List
k8sininput/software_config.jsonand re-runomnia.yml.
⦾ Why does the ‘Initialize Kubeadm’ task fail with ‘nnode.Registration.name: Invalid value: "<Host name>"’?
Potential Cause: The OIM does not support hostnames with an underscore in it, such as ‘mgmt_station’.
Resolution: As defined in RFC 822, the only legal characters are the following:
Alphanumeric (a-z and 0-9): Both uppercase and lowercase letters are acceptable, and the hostname is not case-sensitive. In other words, omnia.test is identical to OMNIA.TEST and Omnia.test.
Hyphen (-): Neither the first nor the last character in a hostname field should be a hyphen.
Period (.): The period should be used only to delimit fields in a hostname (For example, dvader.empire.gov)
⦾ Why does the omnia.yml or scheduler.yml fail at TASK [kubernetes_sigs.kubespray.kubernetes-apps/metallb : MetalLB | Create address pools configuration] ?
Potential Cause: The error occurs when MetalLB Controller Pods don’t come to Running state. This failure is caused due to an issue with Kubespray, a third-party software. For more information about this issue, click here.
Resolution:
Check for metallb controller pods status using:
kubectl get pods -n metallb-system.Once the pods comes to the running state, do the following:
Create a
ipaddrespool.ymlfile with below contents on thekube_control_plane:
Refer the
input/omnia_config.ymland update the<pod_external_ip_range>valueapiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: namespace: "metallb-system" name: "primary" spec: addresses: - "<pod_external_ip_range>" autoAssign: True avoidBuggyIPs: FalseCreate a
l2advertisement.ymlfile with below contents onkube_control_plane:apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: "primary" namespace: "metallb-system" spec: ipAddressPools: - "primary"Run the following command on
kube_control_plane:kubectl get svc -A.Run the following command on
kube_control_plane:kubectl edit svc <svc_name> -n <namespace>Run the following command on
kube_control_plane:kubectl delete -f /etc/kubernetes/metallb.yamlNote
The
metallb.yamlfile is available in the/etc/kubernetesdirectory on thekube_control_plane.
Apply the following manifests strictly in the same order:
kubectl apply -f /etc/kubernetes/metallb.yaml kubectl apply -f l2advertisement.yml kubectl apply -f ipaddrespool.ymlFinally, use the following command to open the definition file and change the
typefromClusterIPtoLoadBalancer:kubectl edit svc <svc_name> -n <namespace>Post this workaround, re-run the
omnia.ymlorscheduler.ymlplaybook.
⦾ Why does the omnia.yml or scheduler.yml playbook execution fails with a Unable to retrieve file contents error?
Potential Cause: This error occurs when the Kubespray collection is not installed during the execution of prepare_oim.yml.
Resolution: Re-run prepare_oim.yml.
⦾ Why does the NFS-client provisioner go to a “ContainerCreating” or “CrashLoopBackOff” state?
Potential Cause: This issue usually occurs when server_share_path given in storage_config.yml for k8s_share does not have an NFS server running.
Resolution:
Ensure that
storage.ymlis executed on the same inventory which is being used forscheduler.yml.Ensure that
server_share_pathmentioned instorage_config.ymlfork8s_share: truehas an active nfs_server running on it.
⦾ If the Nfs-client provisioner is in “ContainerCreating” or “CrashLoopBackOff” state, why does the kubectl describe <pod_name> command show the following output?
Potential Cause: This is a known issue. For more information, click here.
Resolution:
Wait for some time for the pods to come up. or
Do the following:
Run the following command to delete the pod:
kubectl delete pod <pod_name> -n <namespace>Post deletion, the pod will be restarted and it will come to running state.
⦾ Why does the nvidia-device-plugin pods in ContainerCreating status fail with a no runtime for "nvidia" is configured error?
Potential Cause: nvidia-container-toolkit is not installed on GPU nodes.
Resolution: Install Kubernetes, download nvidia-container-toolkit, and perform the necessary configurations based on the OS running on the cluster.
⦾ After running the reset_cluster_configuration.yml playbook on a Kubernetes cluster, which should ideally delete all Kubernetes services and files, it is observed that some Kubernetes logs and configuration files are still present on the kube_control_plane. However, these left-over files do not cause any issues for Kubernetes re-installation on the cluster. The files are present under the following directories:
/var/log/containers//sys/fs/cgroup/etc/system/run/systemd/transient//tmp/releases
Potential Cause: When reset_cluster_configuration.yml is executed on a Kubernetes cluster, it triggers the Kubespray playbook kubernetes_sigs.kubespray.reset internally, which is responsible for removing Kubernetes configuration and services from the cluster. However, this Kubespray playbook doesn’t delete all Kubernetes services and files, resulting in some files being left behind on the kube_control_plane.
Workaround: After running the reset_cluster_configuration.yml playbook on a Kubernetes cluster, users can choose to remove the files from the directories mentioned above if they wish to do so.
⦾ Why are Kubernetes services not accessible?
Potential Cause: When firewalld is enabled on compute nodes, it blocks incoming traffic unless the appropriate ports are explicitly opened. This can prevent access to services exposed by Kubernetes (such as those using NodePort, LoadBalancer, or Ingress).
Resolution: You need to manually open the required firewalld ports in order to allow traffic through the ports used by the Kubernetes services. Perform the following steps:
Open the TCP/UDP ports manually.
For TCP ports, use the following command:
sudo firewall-cmd --permanent --add-port=<port_number>/tcp
For UDP ports, use the following command:
sudo firewall-cmd --permanent --add-port=<port_number>/udp
Reload the firewalld service using the below command to apply the changes.
sudo firewall-cmd --reload
Try accessing the service again. Ensure that the correct ports are open and the service is running. To know more about the ports, click here.
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.