vLLM enablement for clusters containing Intel Gaudi accelerators

Prerequisites

Before enabling the vLLM capabilities of the cluster running Intel Gaudi accelerators, the following prerequisites must be fulfilled:

  1. Verify that the cluster nodes have sufficient allocatable resources for the hugepages-2Mi and Intel Gaudi accelerator. To check the allocatable resources on all nodes, run:

    kubectl describe node <intel-gaudi-node-name> | grep -A 10 "Allocatable"
    
  2. [Optional] If required, you can adjust the resource parameters in the vllm_configuration.yml file based on the availability of resources on the nodes.

Deploy vLLM (Intel)

After you have completed all the prerequisites, do the following to deploy vLLM on a cluster running with Intel Gaudi accelerators:

  1. Create a namespace to manage on your kube_control_plane according to the details provided in vllm_configuration.yml file. Execute the following command:

    kubectl create ns workloads
    
  2. Verify that the namespace has been created by executing the following command:

    kubectl get namespace workloads
    

    Expected output:

    NAME        STATUS  AGE
    workloads   Active  45s
    
  3. To create a configuration file for vLLM deployment, follow these steps:

    1. Locate the vllm_configuration.yml file in the examples/ai_examples/intel/vllm folder.

    2. Open the vllm_configuration.yml file.

    3. Add the necessary details such as Hugging Face token, and allocated resources for the vLLM deployment.

    4. After modifying the file, you have two choices:

      • Directly copy the modified file to your kube_control_plane.

      • Create a new blank <vLLM_configuration_filename>.yml file, paste the modified contents into it, and save it on your kube_control_plane.

    5. Finally, apply the file using the following command:

      kubectl apply -f <vLLM_configuration_filename>.yml
      

    Expected output:

    service/vllm-llama-svc created
    deployment.apps/vllm-llama created
    
  4. To create and apply the Persistent Volume Claim (PVC) configuration file, required to access shared storage, follow these steps:

    1. Create a new blank <PVC_filename>.yml file,

    2. Paste the following content into it, and save it on your kube_control_plane.

      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: shared-model
        namespace: workloads
      spec:
        storageClassName: nfs-client
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: <storage-size>
      
    3. Add the necessary details such as name, namespace, and storage size for the vLLM deployment. Use the same configurations as provided in the <vLLM_configuration_filename>.yml file.

    4. Finally, apply the file using the following command:

      kubectl apply -f <PVC_filename>.yml
      

    Expected output:

    persistentvolumeclaim/shared-model created
    
  5. Verify the PVC is bound and available for the deployment using the following command:

    kubectl get pvc -n workloads
    

    Expected output:

    NAME          STATUS  VOLUME                                     CAPACITY  ACCESS MODES  STORAGECLASS  AGE
    shared-model  Bound   pvc-0a066bce-9511-4f73-ac41-957a8088cfb0   400Gi     RWX           nfs-client    14s
    
  6. After some time, check the status of the pods again to verify if they are up and running. Execute the following command to get the pod status:

    kubectl get pod -n workloads
    

    Expected output (when pods are running):

    NAME                             READY  STATUS    RESTARTS  AGE
    vllm-llama-669bbf5c9b-1h7jm      1/1    Running   0         58s
    
  1. After approximately 30 minutes, verify the service status of the vLLM deployment using the following command:

    kubectl get svc -n workloads
    

    Expected output:

    NAME            TYPE       CLUSTER-IP     EXTERNAL-IP  PORT(S)          AGE
    vllm-llama-svc  NodePort   10.233.13.108  <none>       8000:32195/TCP   71s
    
  2. Finally, verify the endpoints using the following command:

    kubectl get endpoints vllm-llama-svc -n workloads
    

    Expected output:

    NAME             ENDPOINTS               AGE
    vllm-llama-svc   10.233.108.196:8000     82s
    

Final output:

Once vLLM deployment is complete, the following output is displayed while executing the curl -X POST -d "param1=value1&param2=value2" <Intel_node_IP>:<port> command.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.