Add CPU support for PyTorch DDP training#1040
Conversation
Signed-off-by: aagallo <aagallo@amzon.com>
Signed-off-by: aagallo <aagallo@amzon.com>
There was a problem hiding this comment.
Review 1/2 — Existing CPU Support & Approach
I appreciate the effort here, but I need to understand the value add of this PR before proceeding. We have a workshop on April 9th that depends on the CPU DDP test case working reliably, so I want to be careful about merging changes that add maintenance surface area without a clear need. The existing ddp.py training script already supports both CPU and GPU — it auto-detects the available hardware and selects the appropriate backend and device. Before adding duplicate scripts, I'd like to see the cases that the current implementation actually fails on CPU instances.
The existing code already supports CPU
File: 3.test_cases/pytorch/ddp/ddp.py (lines 37-42, 63)
The training script already handles CPU transparently:
def ddp_setup():
if torch.cuda.is_available():
init_process_group(backend="nccl")
else:
init_process_group(backend="gloo")self.device = torch.device(f"cuda:{os.environ['LOCAL_RANK']}" if torch.cuda.is_available() else "cpu")And DDP is initialized with device_ids=None when CUDA is unavailable:
self.model = DDP(self.model, device_ids=[self.device.index] if torch.cuda.is_available() else None)This means ddp.py will run on CPU instances out of the box with torchrun. The only thing that changes for CPU is the PyTorch installation step — and that doesn't require three new files. The existing 0.create-venv.sh installs torch==2.10.0 which pip will resolve to a CPU-compatible wheel on a machine without CUDA.
What I'd like to see: Could you test the existing scripts on your PCS CPU instance setup first and share what specifically fails? If there's a real gap (e.g., pip pulls CUDA dependencies that bloat the venv or the Docker build fails), let's fix it by parameterizing the existing scripts rather than duplicating them.
Bugs worth fixing in the existing code
While reviewing this PR, I noticed a couple of issues in the existing files that are worth fixing separately — and you clearly ran into these:
-
Dockerfileusespytorch/pytorch:latest— this violates the repo convention of pinned version tags. You correctly pinnedtorch==2.10.0in the CPU Dockerfile. It would be great if you could submit a smaller PR that pins the base image version in the existing GPU Dockerfile (e.g.,pytorch/pytorch:2.10.0-cuda12.8-cudnn9-runtime) and adds--no-cache-dirto the pip install. -
MNIST download race condition is a real issue — torchvision's
datasets.MNIST(download=True)has no file locking or atomic writes. When multipletorchrunranks call it simultaneously, they all pass the_check_exists()check and write to the same files concurrently, causing corrupted downloads or extraction failures. This affects both CPU and GPU. The standard fix is the rank-0-with-barrier pattern inddp.py:rank = int(os.environ.get("RANK", 0)) if rank == 0: datasets.MNIST(root='./data', train=True, download=True) torch.distributed.barrier() # All ranks now safely load the already-downloaded data train_set = datasets.MNIST(root='./data', train=True, download=False, transform=transform)
This is a one-line-level change to
load_train_objs()and would be a welcome standalone PR. -
1.venv-train.sbatchis missing the copyright header — the GPU venv sbatch script doesn't have the license header that3.container-train.sbatchdoes.
Things That Look Great
- The PR description is excellent — thorough test plan, clear directory structure, and good explanation of GPU vs CPU differences.
- Version pins on PyTorch (
2.10.0+cpu), torchvision (0.25.0+cpu), and mlflow (2.13.2) are specific and reproducible. - The awareness of MNIST download race conditions in distributed settings shows real hands-on experience.
- Using
python:3.12-slimas the base image is a smart choice for CPU-only workloads.
KeitaW
left a comment
There was a problem hiding this comment.
Review 2/2 — Additional Concerns if Revisited
These would need to be addressed if the PR is revisited after testing the existing scripts on CPU.
| FROM python:3.12-slim | ||
|
|
||
| # --- ADD THIS --- |
There was a problem hiding this comment.
Missing copyright header
All files in this repo require the license header. Also, the # --- ADD THIS --- comment reads like a tutorial instruction rather than a production comment — I'd suggest replacing it with a descriptive comment.
| FROM python:3.12-slim | |
| # --- ADD THIS --- | |
| # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | |
| # SPDX-License-Identifier: MIT-0 | |
| FROM python:3.12-slim | |
| # torchvision requires these system libraries for image decoding |
|
|
||
| WORKDIR /workspace | ||
|
|
||
| # --- CHANGE THESE --- |
There was a problem hiding this comment.
Tutorial-style comment
This comment doesn't add value in the committed file — consider replacing with something descriptive.
| # --- CHANGE THESE --- | |
| # Copy training script and pre-downloaded dataset |
| #!/bin/bash | ||
| set -ex |
There was a problem hiding this comment.
Missing copyright header & shebang inconsistency
This script is missing the required MIT-0 license header. Also, the existing GPU scripts use #!/usr/bin/env bash — I'd suggest matching that convention.
| #!/bin/bash | |
| set -ex | |
| #!/usr/bin/env bash | |
| set -ex | |
| # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | |
| # SPDX-License-Identifier: MIT-0 |
|
|
||
| # 1. Pre-download MNIST to a local folder named 'data' | ||
| # This ensures we have the files ready to COPY into the Docker image | ||
| python3 -c "from torchvision import datasets; datasets.MNIST(root='./data', train=True, download=True)" |
There was a problem hiding this comment.
Host-side torchvision dependency
This line requires torchvision to be installed on the build host (not inside the container). On a fresh Slurm head node, torchvision is unlikely to be available. The existing GPU version (2.create-enroot-image.sh) doesn't need this. I'd suggest either downloading data inside the Dockerfile (a RUN step), or using curl/wget to fetch the raw MNIST files without requiring torchvision on the host.
|
Thanks @KeitaW for the thorough feedback. Let me address the points and propose a path forward: Core Issue: CPU-Specific PyTorch InstallationYou're absolutely right that ddp.py supports both CPU and GPU out of the box. The issue isn't with the training script—it's with the PyTorch installation. When I tested the existing scripts on c6i instances (PCS), the installation succeeded but installed GPU versions of PyTorch and torchvision by default. These GPU versions fail at runtime on instances without GPU hardware. The +cpu suffix is critical: Proposed ApproachI agree that duplicating scripts adds maintenance overhead. However, I'd like to propose keeping the CPU-specific variants for the following reasons:
That said, I'm open to parameterization if you prefer—we could add a --cpu flag to the existing scripts. Let me know your preference. Bugs to FixFor the issues you identified:
Are these strictly necessary to address in this PR, or would you prefer I handle them separately? Fixes for Current PR (if we proceed)
Let me know how you'd like to proceed—I'm happy to either refactor this PR with the fixes above or pivot to a parameterized approach if that's preferred. |
Test Results: Existing DDP Code on CPU InstancesI set up a HyperPod Slurm cluster with CPU-only instances ( Environment
Test 1:
|
Purpose
Enables the distributed training examples to run on CPU instances by adding CPU-specific installation and containerization support for PyTorch DDP training on Amazon Parallel Computing Service (PCS).
Changes
Key differences between GPU and CPU versions:
Test Plan
Environment:
Test commands:
Test Results
Successfully tested on Amazon PCS with 2 nodes using c6i.xlarge instances. Both virtual environment and containerized approaches were validated:
Directory Structure
Modified/Added files:
Checklist
mainbranch.latest).