One Pod Torch with Data

Running Torch with Real Data in a Pod

Now that we have torch installed in a persistent conda environment, we want to illustrate how to copy files from a local machine (i.e., the frontend) directly into our persistent volume. You can use this to transfer data from anywhere you can mount on your local frontend up into the K8S environment. We'll start with a simple python script.

First, we need to deploy a pod that has access to the persistent volume we want to upload something to. A simple example would be:

apiVersion: v1
kind: Pod
metadata:
  name: file-passthrough
spec:
  restartPolicy: Never
  activeDeadlineSeconds: 1800  # 30 minutes
  volumes:
    - name: home-volume
      persistentVolumeClaim:
        claimName: dsmr-vol-01
  containers:
    - name: conda-container
      image: "nvidia/samples:vectoradd-cuda11.2.1"
      volumeMounts:
        - name: home-volume
          mountPath: /kube/home/
      command: ["/bin/sh", "-c"]
      args:
        - |
          echo "Disk space usage for /kube/home volume:"
          df -h /kube/home
          echo "Sleeping indefinitely..."
          sleep infinity

This example pod has access to our persistent volume, and doesn't really do much else. Go ahead and deploy it with kubectl apply -f persistentVolumeUpload.yml.

While that container is created, go ahead and create a python file named gpu.py by copying the below. Create it on the frontend - i.e., cm, not within the pod. This file downloads an image dataset (CIFAR10) into the directory /kube/data, and then runs a few epochs to classify it using a very small convolutinal net:

import torch
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import time

devs = [torch.device("cuda:0" if torch.cuda.is_available() else "cpu"), "cpu"]

for device in devs:
    # Define the CNN model
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(3, 6, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(6, 16, 5)
            self.fc1 = nn.Linear(16 * 5 * 5, 120)
            self.fc2 = nn.Linear(120, 84)
            self.fc3 = nn.Linear(84, 10)

        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = x.view(-1, 16 * 5 * 5)
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x

    net = Net().to(device)

    # Define the loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

    # Load CIFAR-10 dataset
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    trainset = torchvision.datasets.CIFAR10(root='/kube/data', train=True, download=True, transform=transform)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)

    # Train the model
    start_time = time.time()

    for epoch in range(5):  # Loop over the dataset multiple times
        epoch_start_time = time.time()
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()  # Zero the parameter gradients
            outputs = net(inputs)  # Forward
            loss = criterion(outputs, labels)
            loss.backward()  # Backward
            optimizer.step()  # Optimize
            running_loss += loss.item()
            if i % 2000 == 1999:  # Print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0
        epoch_end_time = time.time()
        print("Epoch %d completed in %s seconds" % (epoch+1, round(epoch_end_time - epoch_start_time, 2)))

    end_time = time.time()
    total_time = end_time - start_time
    print("Trained on: " + str(device))
    print("Total training time: %s seconds" % round(total_time, 2))
    print("----------------------------------")
    print("----------------------------------")

Once you've created the python file, confirm that your new file-passthrough pod is up and running with kubectl get pods. If it's running, you can now copy the GPU script into your persistent volume. To do so, you can run:

kubectl cp gpu.py dsmillerrunfol/file-passthrough:/kube/home/gpu.py

Once you run that command, we can run a command against the file-passthrough pod to make sure it copied correctly:

kubectl exec -it file-passthrough -- ls -la /kube/home

If succesful, you should see something like this, including the new gpu.py file:

Running the python file on GPUs

Once your file is copied over, you can shutdown the file-passthrough pod and launch the GPU pod that will actually process the data. For now, we'll only request a single GPU in our pod, and 2 CPUs to help with batching:

apiVersion: v1
kind: Pod
metadata:
  name: torch-test
spec:
  restartPolicy: Never
  volumes:
    - name: home-volume
      persistentVolumeClaim:
        claimName: dsmr-vol-01  # Ensure this is your correct PVC
  containers:
    - name: pytorch-setup-container
      image: "nvidia/samples:vectoradd-cuda11.2.1"
      resources:
        requests:
          memory: "32Gi"
          nvidia.com/gpu: 1
          cpu: "2"
        limits:
          memory: "32Gi"
          nvidia.com/gpu: 1
          cpu: "2"
      volumeMounts:
        - name: home-volume
          mountPath: /kube/home/
      command:
        - /bin/bash
        - -c
        - |
          # Set the Miniconda path and initialize
          export MINICONDA_PATH="/kube/home/.envs/conda"
          export PATH="$MINICONDA_PATH/bin:$PATH"

          # Activate the environment
          source activate torchEnv

          #Make sure our GPUs are loading
          python -c "import torch; gpus = torch.cuda.device_count(); print(f'Available GPUs: {gpus}'); [print(f'GPU {gpu}: {torch.cuda.get_device_name(gpu)}') for gpu in range(gpus)]"

          #Run our gpu torch script
          python /kube/home/gpu.py

Go ahead and submit that with kubectl apply -f 3_torchWithData.yml, and then monitor the progress of torch using kubectl logs torch-test. Note it may take around a minute for logs to start. If the script works correctly, you should see CIFAR downloading, followed by a classification across a few epochs of training:

PreviousSetting up Torch

Last updated 1 year ago