vLLM enablement for clusters containing Intel Gaudi accelerators
Prerequisites
Before enabling the vLLM capabilities of the cluster running Intel Gaudi accelerators, the following prerequisites must be fulfilled:
Verify that the cluster nodes have sufficient allocatable resources for the
hugepages-2MiandIntel Gaudi accelerator. To check the allocatable resources on all nodes, run:kubectl describe node <intel-gaudi-node-name> | grep -A 10 "Allocatable"
[Optional] If required, you can adjust the resource parameters in the
vllm_configuration.ymlfile based on the availability of resources on the nodes.
Deploy vLLM (Intel)
After you have completed all the prerequisites, do the following to deploy vLLM on a cluster running with Intel Gaudi accelerators:
Create a namespace to manage on your
kube_control_planeaccording to the details provided invllm_configuration.ymlfile. Execute the following command:kubectl create ns workloads
Verify that the namespace has been created by executing the following command:
kubectl get namespace workloads
Expected output:
NAME STATUS AGE workloads Active 45s
To create a configuration file for vLLM deployment, follow these steps:
Locate the
vllm_configuration.ymlfile in theexamples/ai_examples/intel/vllmfolder.Open the
vllm_configuration.ymlfile.Add the necessary details such as Hugging Face token, and allocated resources for the vLLM deployment.
After modifying the file, you have two choices:
Directly copy the modified file to your
kube_control_plane.Create a new blank
<vLLM_configuration_filename>.ymlfile, paste the modified contents into it, and save it on yourkube_control_plane.
Finally, apply the file using the following command:
kubectl apply -f <vLLM_configuration_filename>.yml
Expected output:
service/vllm-llama-svc created deployment.apps/vllm-llama created
To create and apply the Persistent Volume Claim (PVC) configuration file, required to access shared storage, follow these steps:
Create a new blank
<PVC_filename>.ymlfile,Paste the following content into it, and save it on your
kube_control_plane.apiVersion: v1 kind: PersistentVolumeClaim metadata: name: shared-model namespace: workloads spec: storageClassName: nfs-client accessModes: - ReadWriteOnce resources: requests: storage: <storage-size>
Add the necessary details such as name, namespace, and storage size for the vLLM deployment. Use the same configurations as provided in the
<vLLM_configuration_filename>.ymlfile.Finally, apply the file using the following command:
kubectl apply -f <PVC_filename>.yml
Expected output:
persistentvolumeclaim/shared-model created
Verify the PVC is bound and available for the deployment using the following command:
kubectl get pvc -n workloads
Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE shared-model Bound pvc-0a066bce-9511-4f73-ac41-957a8088cfb0 400Gi RWX nfs-client 14s
After some time, check the status of the pods again to verify if they are up and running. Execute the following command to get the pod status:
kubectl get pod -n workloads
Expected output (when pods are running):
NAME READY STATUS RESTARTS AGE vllm-llama-669bbf5c9b-1h7jm 1/1 Running 0 58s
After approximately 30 minutes, verify the service status of the vLLM deployment using the following command:
kubectl get svc -n workloads
Expected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE vllm-llama-svc NodePort 10.233.13.108 <none> 8000:32195/TCP 71s
Finally, verify the endpoints using the following command:
kubectl get endpoints vllm-llama-svc -n workloads
Expected output:
NAME ENDPOINTS AGE vllm-llama-svc 10.233.108.196:8000 82s
Final output:
Once vLLM deployment is complete, the following output is displayed while executing the curl -X POST -d "param1=value1¶m2=value2" <Intel_node_IP>:<port> command.
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.