Setup DeepSpeed for clusters containing Intel Gaudi accelerators ================================================================== DeepSpeed is a deep learning optimization library developed by Microsoft, designed to make training large-scale machine learning models more efficient and scalable. It provides several key features that help accelerate training and reduce the resource requirements for training state-of-the-art models. Prerequisites -------------- Before deploying a DeepSpeed MPIJob, the following prerequisites must be fulfilled: 1. Kubeflow must be deployed on all the cluster nodes. `Click here `_ to know more about deploying Kubeflow. 2. Configure the *mpi-operator* package to execute the v2beta1 API. `Click here `_ to know more about this configuration. 3. Verify that the cluster nodes have sufficient allocatable resources for the ``hugepages-2Mi`` and ``Intel Gaudi accelerator``. To check the allocatable resources on all nodes, run: :: kubectl describe node | grep -A 10 "Allocatable" 4. [Optional] If required, you can adjust the resource parameters in the ``ds_configuration.yml`` file based on the availability of resources on the nodes. Deploy DeepSpeed ----------------- After you have completed all the prerequisites, do the following to deploy a DeepSpeed MPIJob: 1. Create a namespace to manage all your DeepSpeed workloads. Execute the following command: :: kubectl create ns workloads 2. Verify that the namespace has been created by executing the following command: :: kubectl get namespace workloads *Expected output*: :: NAME STATUS AGE workloads Active 14s 3. To create and apply the DeepSpeed configuration file, follow these steps: a. Locate the ``ds_configuration.yml`` file in the ``examples/ai_examples/intel/deepSpeed/`` folder. b. Open the ``ds_configuration.yml`` file. c. Add the necessary details such as proxy settings, Hugging Face token, and allocated resources for the DeepSpeed MPIJob. d. After modifying the file, you have two choices: - Directly copy the modified file to your ``kube_control_plane``. - Create a new blank ``.yml`` file, paste the modified contents into it, and save it on your ``kube_control_plane``. e. Finally, apply the file using the following command: :: kubectl apply -f .yml *Expected output*: :: mpijob.kubeflow.org/gaudi-llm-ds-ft created 4. To create and apply the Persistent Volume Claim (PVC) configuration file, required to access shared storage, follow these steps: a. Create a new blank ``.yml`` file, b. Paste the following content into it, and save it on your ``kube_control_plane``. :: apiVersion: v1 kind: PersistentVolumeClaim metadata: name: shared-model namespace: workloads spec: storageClassName: nfs-client accessModes: - ReadWriteOnce resources: requests: storage: c. Add the necessary details such as name, namespace, and storage size for the DeepSpeed MPIJobs. Use the same configurations as provided in the ``.yml`` file. d. Finally, apply the file using the following command: :: kubectl apply -f .yml *Expected output*: :: persistentvolumeclaim/shared-model created 5. After some time, check the status of the pods again to verify if they are up and running. Execute the following command to get the pod status: :: kubectl get pod -n workloads *Expected output (when pods are running)*: :: NAME READY STATUS RESTARTS AGE gaudi-llm-ds-ft-launcher-zfnls 1/1 Running 0 33s gaudi-llm-ds-ft-worker-0 1/1 Running 0 33s 6. [Optional] To better understand the MPIJob resource, you can use the following command: :: kubectl explain mpijob --api-version=kubeflow.org/v2beta1 *Expected output*: :: GROUP: kubeflow.org KIND: MPIJob VERSION: v2beta1 *Final output*: Once DeepSpeed deployment is complete (~ after approx 30 minutes), the following output is displayed while checking the status of the pods using the ``kubectl get pod -n workloads`` command. Here you can see that the launcher pod (``gaudi-llm-ds-ft-launcher-zfnls``) goes to a **Completed** status, and the worker pod (``33sgaudi-llm-ds-ft-worker-0``) is not present as it has exited successfully, signifying a successful deployment: :: NAME READY STATUS RESTARTS AGE gaudi-llm-ds-ft-launcher-zfnls 0/1 Completed 0 10h .. note:: Sample workloads for pretraining, finetuning, and inference are available in the ``examples/ai_examples/intel/`` directory for your reference.