Setup PyTorch
PyTorch is a popular open-source deep learning framework, renowned for its dynamic computation graph that enhances flexibility and ease of use, making it a preferred choice for researchers and developers. With strong community support, PyTorch facilitates seamless experimentation and rapid prototyping in the field of machine learning.
Prerequisites
Ensure nerdctl is available on all cluster nodes.
If GPUs are present on the target nodes, install NVIDIA CUDA (with containerd) or AMD Rocm drivers during provisioning. CPUs do not require any additional drivers.
Use
local_repo.ymlto create an offline PyTorch repository.
[Optional prerequisites]
Ensure the system has enough space.
Ensure the inventory file includes a
kube_control_planeand akube_nodelisting all cluster nodes. Click here for a sample file.Nerdctl does not support mounting directories as devices because it is not a feature of containerd (runtime that nerdctl uses). Individual files need to be attached while running nerdctl.
Deploying PyTorch
Change directories to the
toolsfolder:cd tools
Run the
pytorch.ymlplaybook:ansible-playbook pytorch.yml -i inventory
Note
During the pytorch.yml playbook execution, nodes with AMD or NVIDIA GPUs and drivers will install and test either the pytorch-AMD or pytorch-Nvidia containers, respectively. If neither GPU type is present with its drivers, it will install and test the pytorch-CPU container.
Accessing PyTorch (CPU)
Verify that the PyTorch image is present in container engine images:
nerdctl images
Use the container image per your needs:
nerdctl run -it --rm pytorch/pytorch:latest
For more information, click here.
Accessing PyTorch (AMD GPU)
Verify that the PyTorch image is present in container engine images:
nerdctl images
Use the container image per your needs:
nerdctl run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device /dev/dri/card0 --device /dev/dri/card1 --device /dev/dri/card2 --device /dev/dri/renderD128 --device /dev/dri/renderD129 --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
For more information, click here.
Accessing PyTorch (NVIDIA GPU)
Verify that the PyTorch image is present in container engine images:
nerdctl images
Use the container image per your needs:
nerdctl run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.12-py3
For more information, click here.
Accessing PyTorch (Intel Gaudi accelerator)
Verify that the PyTorch image is present in container engine images:
nerdctl images
Use the container image per your needs:
nerdctl run -it --privileged -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.1/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.