Now that we have torch installed in a persistent conda environment, we want to illustrate how to copy files from a local machine (i.e., the frontend) directly into our persistent volume. You can use this to transfer data from anywhere you can mount on your local frontend up into the K8S environment. We'll start with a simple python script.
First, we need to deploy a pod that has access to the persistent volume we want to upload something to. A simple example would be:
This example pod has access to our persistent volume, and doesn't really do much else. Go ahead and deploy it with kubectl apply -f persistentVolumeUpload.yml.
While that container is created, go ahead and create a python file named gpu.py by copying the below. Create it on the frontend - i.e., cm, not within the pod. This file downloads an image dataset (CIFAR10) into the directory /kube/data, and then runs a few epochs to classify it using a very small convolutinal net:
import torchimport torch.nn as nnimport torch.nn.functional as F import torch.optim as optimimport torchvisionimport torchvision.transforms as transformsimport timedevs = [torch.device("cuda:0"if torch.cuda.is_available() else"cpu"),"cpu"]for device in devs:# Define the CNN modelclassNet(nn.Module):def__init__(self):super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16*5*5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10)defforward(self,x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16*5*5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x)return x net =Net().to(device)# Define the loss function and optimizer criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)# Load CIFAR-10 dataset transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) trainset = torchvision.datasets.CIFAR10(root='/kube/data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)# Train the model start_time = time.time()for epoch inrange(5):# Loop over the dataset multiple times epoch_start_time = time.time() running_loss =0.0for i, data inenumerate(trainloader, 0): inputs, labels = data[0].to(device), data[1].to(device) optimizer.zero_grad()# Zero the parameter gradients outputs =net(inputs)# Forward loss =criterion(outputs, labels) loss.backward()# Backward optimizer.step()# Optimize running_loss += loss.item()if i %2000==1999:# Print every 2000 mini-batchesprint('[%d, %5d] loss: %.3f'% (epoch +1, i +1, running_loss /2000)) running_loss =0.0 epoch_end_time = time.time()print("Epoch %d completed in %s seconds"% (epoch+1, round(epoch_end_time - epoch_start_time, 2))) end_time = time.time() total_time = end_time - start_timeprint("Trained on: "+str(device))print("Total training time: %s seconds"%round(total_time, 2))print("----------------------------------")print("----------------------------------")
Once you've created the python file, confirm that your new file-passthrough pod is up and running with kubectl get pods. If it's running, you can now copy the GPU script into your persistent volume. To do so, you can run:
Once you run that command, we can run a command against the file-passthrough pod to make sure it copied correctly:
kubectl exec -it file-passthrough -- ls -la /kube/home
Running the python file on GPUs
Once your file is copied over, you can shutdown the file-passthrough pod and launch the GPU pod that will actually process the data. For now, we'll only request a single GPU in our pod, and 2 CPUs to help with batching:
apiVersion:v1kind:Podmetadata:name:torch-testspec:restartPolicy:Nevervolumes: - name:home-volumepersistentVolumeClaim:claimName:dsmr-vol-01# Ensure this is your correct PVCcontainers: - name:pytorch-setup-containerimage:"nvidia/samples:vectoradd-cuda11.2.1"resources:requests:memory:"32Gi"nvidia.com/gpu:1cpu:"2"limits:memory:"32Gi"nvidia.com/gpu:1cpu:"2"volumeMounts: - name:home-volumemountPath:/kube/home/command: - /bin/bash - -c - | # Set the Miniconda path and initialize export MINICONDA_PATH="/kube/home/.envs/conda" export PATH="$MINICONDA_PATH/bin:$PATH" # Activate the environment source activate torchEnv #Make sure our GPUs are loading python -c "import torch; gpus = torch.cuda.device_count(); print(f'Available GPUs: {gpus}'); [print(f'GPU {gpu}: {torch.cuda.get_device_name(gpu)}') for gpu in range(gpus)]" #Run our gpu torch script python /kube/home/gpu.py
If succesful, you should see something like this, including the new gpu.py file:
Go ahead and submit that with kubectl apply -f 3_torchWithData.yml, and then monitor the progress of torch using kubectl logs torch-test. Note it may take around a minute for logs to start. If the script works correctly, you should see CIFAR downloading, followed by a classification across a few epochs of training: