Setup DeepSpeed for clusters containing Intel Gaudi accelerators

DeepSpeed is a deep learning optimization library developed by Microsoft, designed to make training large-scale machine learning models more efficient and scalable. It provides several key features that help accelerate training and reduce the resource requirements for training state-of-the-art models.

Prerequisites

Before deploying a DeepSpeed MPIJob, the following prerequisites must be fulfilled:

  1. Kubeflow must be deployed on all the cluster nodes. Click here to know more about deploying Kubeflow.

  2. Configure the mpi-operator package to execute the v2beta1 API. Click here to know more about this configuration.

  3. Verify that the cluster nodes have sufficient allocatable resources for the hugepages-2Mi and Intel Gaudi accelerator. To check the allocatable resources on all nodes, run:

    kubectl describe node <intel-gaudi-node-name> | grep -A 10 "Allocatable"
    
  4. [Optional] If required, you can adjust the resource parameters in the ds_configuration.yml file based on the availability of resources on the nodes.

Deploy DeepSpeed

After you have completed all the prerequisites, do the following to deploy a DeepSpeed MPIJob:

  1. Create a namespace to manage all your DeepSpeed workloads. Execute the following command:

    kubectl create ns workloads
    
  2. Verify that the namespace has been created by executing the following command:

    kubectl get namespace workloads
    

    Expected output:

    NAME        STATUS  AGE
    workloads   Active  14s
    
  3. To create and apply the DeepSpeed configuration file, follow these steps:

    1. Locate the ds_configuration.yml file in the examples/ai_examples/intel/deepSpeed/ folder.

    2. Open the ds_configuration.yml file.

    3. Add the necessary details such as proxy settings, Hugging Face token, and allocated resources for the DeepSpeed MPIJob.

    4. After modifying the file, you have two choices:

      • Directly copy the modified file to your kube_control_plane.

      • Create a new blank <DeepSpeed_configuration_filename>.yml file, paste the modified contents into it, and save it on your kube_control_plane.

    5. Finally, apply the file using the following command:

      kubectl apply -f <DeepSpeed_configuration_filename>.yml
      

    Expected output:

    mpijob.kubeflow.org/gaudi-llm-ds-ft created
    
  4. To create and apply the Persistent Volume Claim (PVC) configuration file, required to access shared storage, follow these steps:

    1. Create a new blank <PVC_filename>.yml file,

    2. Paste the following content into it, and save it on your kube_control_plane.

      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: shared-model
        namespace: workloads
      spec:
        storageClassName: nfs-client
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: <storage-size>
      
    3. Add the necessary details such as name, namespace, and storage size for the DeepSpeed MPIJobs. Use the same configurations as provided in the <DeepSpeed_configuration_filename>.yml file.

    4. Finally, apply the file using the following command:

      kubectl apply -f <PVC_filename>.yml
      

    Expected output:

    persistentvolumeclaim/shared-model created
    
  5. After some time, check the status of the pods again to verify if they are up and running. Execute the following command to get the pod status:

    kubectl get pod -n workloads
    

    Expected output (when pods are running):

    NAME                             READY  STATUS    RESTARTS  AGE
    gaudi-llm-ds-ft-launcher-zfnls   1/1    Running   0         33s
    gaudi-llm-ds-ft-worker-0         1/1    Running   0         33s
    
  6. [Optional] To better understand the MPIJob resource, you can use the following command:

    kubectl explain mpijob --api-version=kubeflow.org/v2beta1
    

    Expected output:

    GROUP: kubeflow.org
    KIND: MPIJob
    VERSION: v2beta1
    

Final output:

Once DeepSpeed deployment is complete (~ after approx 30 minutes), the following output is displayed while checking the status of the pods using the kubectl get pod -n workloads command. Here you can see that the launcher pod (gaudi-llm-ds-ft-launcher-zfnls) goes to a Completed status, and the worker pod (33sgaudi-llm-ds-ft-worker-0) is not present as it has exited successfully, signifying a successful deployment:

NAME                             READY  STATUS      RESTARTS  AGE
gaudi-llm-ds-ft-launcher-zfnls   0/1    Completed   0         10h

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.