Skip to content

Add CPU support for PyTorch DDP training#1040

Open
aagallo wants to merge 2 commits intoawslabs:mainfrom
aagallo:torchvision_cpu
Open

Add CPU support for PyTorch DDP training#1040
aagallo wants to merge 2 commits intoawslabs:mainfrom
aagallo:torchvision_cpu

Conversation

@aagallo
Copy link
Copy Markdown

@aagallo aagallo commented Mar 27, 2026

Purpose

Enables the distributed training examples to run on CPU instances by adding CPU-specific installation and containerization support for PyTorch DDP training on Amazon Parallel Computing Service (PCS).

Changes

  • Created 0.create-venv-cpu.sh - CPU-specific virtual environment setup script that installs PyTorch and Torchvision with CPU-only support
  • Created 2.create-enroot-image-cpu.sh - CPU-specific Enroot container image creation script
  • Created Dockerfile.cpu - CPU-specific Dockerfile based on Python 3.12-slim that includes:
    • System dependencies for torchvision image processing (libgl1, libglx-mesa0, libglib2.0-0)
    • PyTorch 2.10.0+cpu and Torchvision 0.25.0+cpu from the official PyTorch CPU repository
    • MLflow 2.13.2 and sagemaker-mlflow 0.1.0 for experiment tracking
    • Pre-copied MNIST dataset to prevent download race conditions during parallel training
    • Optimized image size with --no-cache-dir and apt cleanup
  • Added slurm/data/ directory - Contains pre-downloaded MNIST dataset to avoid concurrent download issues across nodes
  • Modified installation process to support CPU-only deployments without GPU dependencies

Key differences between GPU and CPU versions:

  • Dockerfile.cpu uses python:3.12-slim base image vs. pytorch/pytorch:latest in the GPU version
  • Dockerfile.cpu explicitly installs CPU-specific PyTorch/Torchvision packages from the CPU wheel repository
  • Dockerfile.cpu includes additional system libraries required for torchvision image operations
  • Dockerfile.cpu pre-copies training data and removes compressed files to prevent race conditions

Test Plan

Environment:

  • AWS Service: Amazon Parallel Computing Service (PCS)
  • Instance type: c6i.xlarge
  • Number of nodes: 2

Test commands:

# Create CPU-specific virtual environment
./0.create-venv-cpu.sh

# Submit training job to Slurm (venv-based)
sbatch 1.venv-train.sbatch

# Create CPU-specific Enroot container image
./2.create-enroot-image-cpu.sh

# Submit training job to Slurm (container-based)
sbatch 3.container-train.sbatch

Test Results

Successfully tested on Amazon PCS with 2 nodes using c6i.xlarge instances. Both virtual environment and containerized approaches were validated:

  • CPU-specific virtual environment created without GPU dependencies
  • CPU-specific container image built successfully with all required system libraries
  • Training jobs executed successfully in both venv and container modes
  • Pre-copied MNIST dataset prevented download race conditions across nodes

Directory Structure

3.test_cases/
└── pytorch/
    └── ddp/
        ├── kubernetes/
        ├── slurm/
        │   ├── 0.create-venv-cpu.sh          # New: CPU venv setup
        │   ├── 0.create-venv.sh               # Original GPU venv setup
        │   ├── 1.venv-train.sbatch
        │   ├── 2.create-enroot-image-cpu.sh  # New: CPU container image
        │   ├── 2.create-enroot-image.sh       # Original GPU container image
        │   ├── 3.container-train.sbatch
        │   ├── data/                          # New: Pre-downloaded MNIST dataset
        │   └── README.md
        ├── .gitignore
        ├── ddp.py
        ├── Dockerfile                         # Original GPU Dockerfile
        ├── Dockerfile.cpu                     # New: CPU Dockerfile
        └── README.md

Modified/Added files:

  • Added: 3.test_cases/pytorch/ddp/slurm/0.create-venv-cpu.sh
  • Added: 3.test_cases/pytorch/ddp/slurm/2.create-enroot-image-cpu.sh
  • Added: 3.test_cases/pytorch/ddp/slurm/data/ (MNIST dataset directory)
  • Added: 3.test_cases/pytorch/ddp/Dockerfile.cpu
  • Updated: 3.test_cases/pytorch/ddp/README.md (to document CPU usage instructions)
  • Updated: 3.test_cases/pytorch/ddp/slurm/README.md (to document CPU-specific scripts)

Checklist

  • I have read the contributing guidelines.
  • I am working against the latest main branch.
  • I have searched existing open and recently merged PRs to confirm this is not a duplicate.
  • The contribution is self-contained with documentation and scripts.
  • External dependencies are pinned to a specific version or tag (no latest).
  • A README is included or updated with prerequisites, instructions, and known issues.
  • New test cases follow the expected directory structure.

aagallo added 2 commits March 26, 2026 11:37
Signed-off-by: aagallo <aagallo@amzon.com>
Signed-off-by: aagallo <aagallo@amzon.com>
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 1/2 — Existing CPU Support & Approach

I appreciate the effort here, but I need to understand the value add of this PR before proceeding. We have a workshop on April 9th that depends on the CPU DDP test case working reliably, so I want to be careful about merging changes that add maintenance surface area without a clear need. The existing ddp.py training script already supports both CPU and GPU — it auto-detects the available hardware and selects the appropriate backend and device. Before adding duplicate scripts, I'd like to see the cases that the current implementation actually fails on CPU instances.

The existing code already supports CPU

File: 3.test_cases/pytorch/ddp/ddp.py (lines 37-42, 63)

The training script already handles CPU transparently:

def ddp_setup():
    if torch.cuda.is_available():
        init_process_group(backend="nccl")
    else:
        init_process_group(backend="gloo")
self.device = torch.device(f"cuda:{os.environ['LOCAL_RANK']}" if torch.cuda.is_available() else "cpu")

And DDP is initialized with device_ids=None when CUDA is unavailable:

self.model = DDP(self.model, device_ids=[self.device.index] if torch.cuda.is_available() else None)

This means ddp.py will run on CPU instances out of the box with torchrun. The only thing that changes for CPU is the PyTorch installation step — and that doesn't require three new files. The existing 0.create-venv.sh installs torch==2.10.0 which pip will resolve to a CPU-compatible wheel on a machine without CUDA.

What I'd like to see: Could you test the existing scripts on your PCS CPU instance setup first and share what specifically fails? If there's a real gap (e.g., pip pulls CUDA dependencies that bloat the venv or the Docker build fails), let's fix it by parameterizing the existing scripts rather than duplicating them.


Bugs worth fixing in the existing code

While reviewing this PR, I noticed a couple of issues in the existing files that are worth fixing separately — and you clearly ran into these:

  1. Dockerfile uses pytorch/pytorch:latest — this violates the repo convention of pinned version tags. You correctly pinned torch==2.10.0 in the CPU Dockerfile. It would be great if you could submit a smaller PR that pins the base image version in the existing GPU Dockerfile (e.g., pytorch/pytorch:2.10.0-cuda12.8-cudnn9-runtime) and adds --no-cache-dir to the pip install.

  2. MNIST download race condition is a real issue — torchvision's datasets.MNIST(download=True) has no file locking or atomic writes. When multiple torchrun ranks call it simultaneously, they all pass the _check_exists() check and write to the same files concurrently, causing corrupted downloads or extraction failures. This affects both CPU and GPU. The standard fix is the rank-0-with-barrier pattern in ddp.py:

    rank = int(os.environ.get("RANK", 0))
    if rank == 0:
        datasets.MNIST(root='./data', train=True, download=True)
    torch.distributed.barrier()
    # All ranks now safely load the already-downloaded data
    train_set = datasets.MNIST(root='./data', train=True, download=False, transform=transform)

    This is a one-line-level change to load_train_objs() and would be a welcome standalone PR.

  3. 1.venv-train.sbatch is missing the copyright header — the GPU venv sbatch script doesn't have the license header that 3.container-train.sbatch does.


Things That Look Great

  • The PR description is excellent — thorough test plan, clear directory structure, and good explanation of GPU vs CPU differences.
  • Version pins on PyTorch (2.10.0+cpu), torchvision (0.25.0+cpu), and mlflow (2.13.2) are specific and reproducible.
  • The awareness of MNIST download race conditions in distributed settings shows real hands-on experience.
  • Using python:3.12-slim as the base image is a smart choice for CPU-only workloads.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 2/2 — Additional Concerns if Revisited

These would need to be addressed if the PR is revisited after testing the existing scripts on CPU.

Comment on lines +1 to +3
FROM python:3.12-slim

# --- ADD THIS ---
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing copyright header

All files in this repo require the license header. Also, the # --- ADD THIS --- comment reads like a tutorial instruction rather than a production comment — I'd suggest replacing it with a descriptive comment.

Suggested change
FROM python:3.12-slim
# --- ADD THIS ---
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
FROM python:3.12-slim
# torchvision requires these system libraries for image decoding


WORKDIR /workspace

# --- CHANGE THESE ---
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tutorial-style comment

This comment doesn't add value in the committed file — consider replacing with something descriptive.

Suggested change
# --- CHANGE THESE ---
# Copy training script and pre-downloaded dataset

Comment on lines +1 to +2
#!/bin/bash
set -ex
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing copyright header & shebang inconsistency

This script is missing the required MIT-0 license header. Also, the existing GPU scripts use #!/usr/bin/env bash — I'd suggest matching that convention.

Suggested change
#!/bin/bash
set -ex
#!/usr/bin/env bash
set -ex
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0


# 1. Pre-download MNIST to a local folder named 'data'
# This ensures we have the files ready to COPY into the Docker image
python3 -c "from torchvision import datasets; datasets.MNIST(root='./data', train=True, download=True)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host-side torchvision dependency

This line requires torchvision to be installed on the build host (not inside the container). On a fresh Slurm head node, torchvision is unlikely to be available. The existing GPU version (2.create-enroot-image.sh) doesn't need this. I'd suggest either downloading data inside the Dockerfile (a RUN step), or using curl/wget to fetch the raw MNIST files without requiring torchvision on the host.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few comments.

@aagallo
Copy link
Copy Markdown
Author

aagallo commented Mar 27, 2026

Thanks @KeitaW for the thorough feedback. Let me address the points and propose a path forward:

Core Issue: CPU-Specific PyTorch Installation

You're absolutely right that ddp.py supports both CPU and GPU out of the box. The issue isn't with the training script—it's with the PyTorch installation.

When I tested the existing scripts on c6i instances (PCS), the installation succeeded but installed GPU versions of PyTorch and torchvision by default. These GPU versions fail at runtime on instances without GPU hardware.

The +cpu suffix is critical: torch==2.10.0+cpu explicitly tells pip to use the CPU-only wheel from the PyTorch CPU repository, ensuring the libraries work correctly on CPU-only instances.

Proposed Approach

I agree that duplicating scripts adds maintenance overhead. However, I'd like to propose keeping the CPU-specific variants for the following reasons:

  • Backward compatibility: There are existing workshops and assets already created using the current structure. Parameterizing the existing scripts would break compatibility with these materials.
  • Clarity: Separate CPU scripts make it immediately clear which version to use for CPU vs GPU environments.

That said, I'm open to parameterization if you prefer—we could add a --cpu flag to the existing scripts. Let me know your preference.

Bugs to Fix

For the issues you identified:

  • MNIST download race condition in ddp.py using the rank-0-with-barrier pattern you suggested
  • Pin base image version in Dockerfile (e.g., pytorch/pytorch:2.10.0-cuda12.8-cudnn9-runtime) and add --no-cache-dir

Are these strictly necessary to address in this PR, or would you prefer I handle them separately?

Fixes for Current PR (if we proceed)

  • If we move forward with the CPU-specific files, I'll address:
  • Add copyright headers to all new files (Dockerfile.cpu, 2.create-enroot-image-cpu.sh)
  • Replace tutorial-style comments with descriptive ones
  • Use #!/usr/bin/env bash for consistency
  • Address the host-side torchvision dependency by downloading MNIST inside the Dockerfile (RUN step) instead of requiring torchvision on the build host

Let me know how you'd like to proceed—I'm happy to either refactor this PR with the fixes above or pivot to a parameterized approach if that's preferred.

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 28, 2026

Test Results: Existing DDP Code on CPU Instances

I set up a HyperPod Slurm cluster with CPU-only instances (ml.c5.2xlarge x2 compute nodes) in us-east-1 and ran the existing, unmodified test case scripts. Here are the results:

Environment

  • Cluster: HyperPod Slurm (cpu-ddp-test)
  • Compute nodes: 2x ml.c5.2xlarge (CPU-only, no GPU)
  • PyTorch: 2.10.0 (GPU wheel, installed via existing 0.create-venv.sh)
  • Shared storage: FSx Lustre

Test 1: 0.create-venv.sh (existing GPU venv script)

Result: PASSpip install torch==2.10.0 torchvision==0.25.0 completed successfully on CPU instances. It pulls ~2GB of unnecessary NVIDIA dependencies (cudnn, nccl, etc.), but they install without error.

Test 2: sbatch 1.venv-train.sbatch (first run, no pre-downloaded data)

Result: FAIL — but not because of CPU/GPU. The failure was the MNIST download race condition:

[rank1]: RuntimeError: File not found or corrupted.

Both ranks called datasets.MNIST(download=True) simultaneously on the shared FSx filesystem. Rank 1 tried to verify an MD5 checksum while rank 0 was still writing the file, causing corruption. This is a pre-existing bug that affects both CPU and GPU — it has nothing to do with the PyTorch wheel variant.

Test 3: sbatch 1.venv-train.sbatch (after pre-downloading MNIST on head node)

Result: PASS — training completed successfully across 2 CPU nodes:

Using GLOO backend for CPU training
Using GLOO backend for CPU training
[Gloo] Rank 0 is connected to 1 peer ranks.
[Gloo] Rank 1 is connected to 1 peer ranks.
[RANK 0] Epoch 0 | Batchsize: 32 | Steps: 938 | Loss: 0.198
[RANK 1] Epoch 0 | Batchsize: 32 | Steps: 938 | Loss: 0.199
...
[RANK 0] Epoch 9 | Batchsize: 32 | Steps: 938 | Loss: 0.023
[RANK 1] Epoch 9 | Batchsize: 32 | Steps: 938 | Loss: 0.024
worker group successfully finished. Waiting 300 seconds for other agents to finish

The existing ddp.py auto-detected CPU, selected the GLOO backend, and trained to convergence — all with the standard GPU PyTorch wheel.

Conclusion

The +cpu wheel suffix saves disk space (~2GB less NVIDIA deps) but is not required for correctness. The GPU wheels install and run fine on CPU-only instances. The only real issue is the MNIST download race condition, which is a pre-existing bug in ddp.py that affects both CPU and GPU deployments on shared filesystems.

Suggested follow-up PRs

I'd love to see you contribute these as separate, focused PRs:

  1. Fix the MNIST download race condition in ddp.py — add the rank-0-with-barrier pattern to load_train_objs():

    rank = int(os.environ.get("RANK", 0))
    if rank == 0:
        datasets.MNIST(root='./data', train=True, download=True)
    torch.distributed.barrier()
    train_set = datasets.MNIST(root='./data', train=True, download=False, transform=transform)
  2. Pin the base image in the existing Dockerfile — replace pytorch/pytorch:latest with a pinned version tag.

These would be valuable, targeted improvements. Thanks for the effort on this PR — your investigation of the race condition was spot-on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants