Add CPU support for PyTorch DDP training by aagallo · Pull Request #1040 · awslabs/awsome-distributed-training

aagallo · 2026-03-27T00:26:23Z

Purpose

Enables the distributed training examples to run on CPU instances by adding CPU-specific installation and containerization support for PyTorch DDP training on Amazon Parallel Computing Service (PCS).

Changes

Created 0.create-venv-cpu.sh - CPU-specific virtual environment setup script that installs PyTorch and Torchvision with CPU-only support
Created 2.create-enroot-image-cpu.sh - CPU-specific Enroot container image creation script
Created Dockerfile.cpu - CPU-specific Dockerfile based on Python 3.12-slim that includes:
- System dependencies for torchvision image processing (libgl1, libglx-mesa0, libglib2.0-0)
- PyTorch 2.10.0+cpu and Torchvision 0.25.0+cpu from the official PyTorch CPU repository
- MLflow 2.13.2 and sagemaker-mlflow 0.1.0 for experiment tracking
- Pre-copied MNIST dataset to prevent download race conditions during parallel training
- Optimized image size with --no-cache-dir and apt cleanup
Added slurm/data/ directory - Contains pre-downloaded MNIST dataset to avoid concurrent download issues across nodes
Modified installation process to support CPU-only deployments without GPU dependencies

Key differences between GPU and CPU versions:

Dockerfile.cpu uses python:3.12-slim base image vs. pytorch/pytorch:latest in the GPU version
Dockerfile.cpu explicitly installs CPU-specific PyTorch/Torchvision packages from the CPU wheel repository
Dockerfile.cpu includes additional system libraries required for torchvision image operations
Dockerfile.cpu pre-copies training data and removes compressed files to prevent race conditions

Test Plan

Environment:

AWS Service: Amazon Parallel Computing Service (PCS)
Instance type: c6i.xlarge
Number of nodes: 2

Test commands:

# Create CPU-specific virtual environment
./0.create-venv-cpu.sh

# Submit training job to Slurm (venv-based)
sbatch 1.venv-train.sbatch

# Create CPU-specific Enroot container image
./2.create-enroot-image-cpu.sh

# Submit training job to Slurm (container-based)
sbatch 3.container-train.sbatch

Test Results

Successfully tested on Amazon PCS with 2 nodes using c6i.xlarge instances. Both virtual environment and containerized approaches were validated:

CPU-specific virtual environment created without GPU dependencies
CPU-specific container image built successfully with all required system libraries
Training jobs executed successfully in both venv and container modes
Pre-copied MNIST dataset prevented download race conditions across nodes

Directory Structure

3.test_cases/
└── pytorch/
    └── ddp/
        ├── kubernetes/
        ├── slurm/
        │   ├── 0.create-venv-cpu.sh          # New: CPU venv setup
        │   ├── 0.create-venv.sh               # Original GPU venv setup
        │   ├── 1.venv-train.sbatch
        │   ├── 2.create-enroot-image-cpu.sh  # New: CPU container image
        │   ├── 2.create-enroot-image.sh       # Original GPU container image
        │   ├── 3.container-train.sbatch
        │   ├── data/                          # New: Pre-downloaded MNIST dataset
        │   └── README.md
        ├── .gitignore
        ├── ddp.py
        ├── Dockerfile                         # Original GPU Dockerfile
        ├── Dockerfile.cpu                     # New: CPU Dockerfile
        └── README.md

Modified/Added files:

Added: 3.test_cases/pytorch/ddp/slurm/0.create-venv-cpu.sh
Added: 3.test_cases/pytorch/ddp/slurm/2.create-enroot-image-cpu.sh
Added: 3.test_cases/pytorch/ddp/slurm/data/ (MNIST dataset directory)
Added: 3.test_cases/pytorch/ddp/Dockerfile.cpu
Updated: 3.test_cases/pytorch/ddp/README.md (to document CPU usage instructions)
Updated: 3.test_cases/pytorch/ddp/slurm/README.md (to document CPU-specific scripts)

Checklist

I have read the contributing guidelines.
I am working against the latest main branch.
I have searched existing open and recently merged PRs to confirm this is not a duplicate.
The contribution is self-contained with documentation and scripts.
External dependencies are pinned to a specific version or tag (no latest).
A README is included or updated with prerequisites, instructions, and known issues.
New test cases follow the expected directory structure.

Signed-off-by: aagallo <aagallo@amzon.com>

KeitaW

Review 1/2 — Existing CPU Support & Approach

I appreciate the effort here, but I need to understand the value add of this PR before proceeding. We have a workshop on April 9th that depends on the CPU DDP test case working reliably, so I want to be careful about merging changes that add maintenance surface area without a clear need. The existing ddp.py training script already supports both CPU and GPU — it auto-detects the available hardware and selects the appropriate backend and device. Before adding duplicate scripts, I'd like to see the cases that the current implementation actually fails on CPU instances.

The existing code already supports CPU

File: 3.test_cases/pytorch/ddp/ddp.py (lines 37-42, 63)

The training script already handles CPU transparently:

def ddp_setup():
    if torch.cuda.is_available():
        init_process_group(backend="nccl")
    else:
        init_process_group(backend="gloo")

self.device = torch.device(f"cuda:{os.environ['LOCAL_RANK']}" if torch.cuda.is_available() else "cpu")

And DDP is initialized with device_ids=None when CUDA is unavailable:

self.model = DDP(self.model, device_ids=[self.device.index] if torch.cuda.is_available() else None)

This means ddp.py will run on CPU instances out of the box with torchrun. The only thing that changes for CPU is the PyTorch installation step — and that doesn't require three new files. The existing 0.create-venv.sh installs torch==2.10.0 which pip will resolve to a CPU-compatible wheel on a machine without CUDA.

What I'd like to see: Could you test the existing scripts on your PCS CPU instance setup first and share what specifically fails? If there's a real gap (e.g., pip pulls CUDA dependencies that bloat the venv or the Docker build fails), let's fix it by parameterizing the existing scripts rather than duplicating them.

Bugs worth fixing in the existing code

While reviewing this PR, I noticed a couple of issues in the existing files that are worth fixing separately — and you clearly ran into these:

Dockerfile uses pytorch/pytorch:latest — this violates the repo convention of pinned version tags. You correctly pinned torch==2.10.0 in the CPU Dockerfile. It would be great if you could submit a smaller PR that pins the base image version in the existing GPU Dockerfile (e.g., pytorch/pytorch:2.10.0-cuda12.8-cudnn9-runtime) and adds --no-cache-dir to the pip install.
MNIST download race condition is a real issue — torchvision's datasets.MNIST(download=True) has no file locking or atomic writes. When multiple torchrun ranks call it simultaneously, they all pass the _check_exists() check and write to the same files concurrently, causing corrupted downloads or extraction failures. This affects both CPU and GPU. The standard fix is the rank-0-with-barrier pattern in ddp.py:
```
rank = int(os.environ.get("RANK", 0))
if rank == 0:
    datasets.MNIST(root='./data', train=True, download=True)
torch.distributed.barrier()
# All ranks now safely load the already-downloaded data
train_set = datasets.MNIST(root='./data', train=True, download=False, transform=transform)
```
This is a one-line-level change to load_train_objs() and would be a welcome standalone PR.
1.venv-train.sbatch is missing the copyright header — the GPU venv sbatch script doesn't have the license header that 3.container-train.sbatch does.

Things That Look Great

The PR description is excellent — thorough test plan, clear directory structure, and good explanation of GPU vs CPU differences.
Version pins on PyTorch (2.10.0+cpu), torchvision (0.25.0+cpu), and mlflow (2.13.2) are specific and reproducible.
The awareness of MNIST download race conditions in distributed settings shows real hands-on experience.
Using python:3.12-slim as the base image is a smart choice for CPU-only workloads.

KeitaW

Review 2/2 — Additional Concerns if Revisited

These would need to be addressed if the PR is revisited after testing the existing scripts on CPU.

KeitaW · 2026-03-27T10:14:07Z

+FROM python:3.12-slim
+
+# --- ADD THIS ---


Missing copyright header

All files in this repo require the license header. Also, the # --- ADD THIS --- comment reads like a tutorial instruction rather than a production comment — I'd suggest replacing it with a descriptive comment.

Suggested change

FROM python:3.12-slim

# --- ADD THIS ---

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

# SPDX-License-Identifier: MIT-0

FROM python:3.12-slim

# torchvision requires these system libraries for image decoding

KeitaW · 2026-03-27T10:14:07Z

+
+WORKDIR /workspace
+
+# --- CHANGE THESE ---


Tutorial-style comment

This comment doesn't add value in the committed file — consider replacing with something descriptive.

Suggested change

# --- CHANGE THESE ---

# Copy training script and pre-downloaded dataset

KeitaW · 2026-03-27T10:14:07Z

+#!/bin/bash
+set -ex


Missing copyright header & shebang inconsistency

This script is missing the required MIT-0 license header. Also, the existing GPU scripts use #!/usr/bin/env bash — I'd suggest matching that convention.

Suggested change

#!/bin/bash

set -ex

#!/usr/bin/env bash

set -ex

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

# SPDX-License-Identifier: MIT-0

KeitaW · 2026-03-27T10:14:07Z

+
+# 1. Pre-download MNIST to a local folder named 'data'
+# This ensures we have the files ready to COPY into the Docker image
+python3 -c "from torchvision import datasets; datasets.MNIST(root='./data', train=True, download=True)"


Host-side torchvision dependency

This line requires torchvision to be installed on the build host (not inside the container). On a fresh Slurm head node, torchvision is unlikely to be available. The existing GPU version (2.create-enroot-image.sh) doesn't need this. I'd suggest either downloading data inside the Dockerfile (a RUN step), or using curl/wget to fetch the raw MNIST files without requiring torchvision on the host.

KeitaW

Left few comments.

aagallo · 2026-03-27T14:01:24Z

Thanks @KeitaW for the thorough feedback. Let me address the points and propose a path forward:

Core Issue: CPU-Specific PyTorch Installation

You're absolutely right that ddp.py supports both CPU and GPU out of the box. The issue isn't with the training script—it's with the PyTorch installation.

When I tested the existing scripts on c6i instances (PCS), the installation succeeded but installed GPU versions of PyTorch and torchvision by default. These GPU versions fail at runtime on instances without GPU hardware.

The +cpu suffix is critical: torch==2.10.0+cpu explicitly tells pip to use the CPU-only wheel from the PyTorch CPU repository, ensuring the libraries work correctly on CPU-only instances.

Proposed Approach

I agree that duplicating scripts adds maintenance overhead. However, I'd like to propose keeping the CPU-specific variants for the following reasons:

Backward compatibility: There are existing workshops and assets already created using the current structure. Parameterizing the existing scripts would break compatibility with these materials.
Clarity: Separate CPU scripts make it immediately clear which version to use for CPU vs GPU environments.

That said, I'm open to parameterization if you prefer—we could add a --cpu flag to the existing scripts. Let me know your preference.

Bugs to Fix

For the issues you identified:

MNIST download race condition in ddp.py using the rank-0-with-barrier pattern you suggested
Pin base image version in Dockerfile (e.g., pytorch/pytorch:2.10.0-cuda12.8-cudnn9-runtime) and add --no-cache-dir

Are these strictly necessary to address in this PR, or would you prefer I handle them separately?

Fixes for Current PR (if we proceed)

If we move forward with the CPU-specific files, I'll address:
Add copyright headers to all new files (Dockerfile.cpu, 2.create-enroot-image-cpu.sh)
Replace tutorial-style comments with descriptive ones
Use #!/usr/bin/env bash for consistency
Address the host-side torchvision dependency by downloading MNIST inside the Dockerfile (RUN step) instead of requiring torchvision on the build host

Let me know how you'd like to proceed—I'm happy to either refactor this PR with the fixes above or pivot to a parameterized approach if that's preferred.

KeitaW · 2026-03-28T00:45:11Z

Test Results: Existing DDP Code on CPU Instances

I set up a HyperPod Slurm cluster with CPU-only instances (ml.c5.2xlarge x2 compute nodes) in us-east-1 and ran the existing, unmodified test case scripts. Here are the results:

Environment

Cluster: HyperPod Slurm (cpu-ddp-test)
Compute nodes: 2x ml.c5.2xlarge (CPU-only, no GPU)
PyTorch: 2.10.0 (GPU wheel, installed via existing 0.create-venv.sh)
Shared storage: FSx Lustre

Test 1: `0.create-venv.sh` (existing GPU venv script)

Result: PASS — pip install torch==2.10.0 torchvision==0.25.0 completed successfully on CPU instances. It pulls ~2GB of unnecessary NVIDIA dependencies (cudnn, nccl, etc.), but they install without error.

Test 2: `sbatch 1.venv-train.sbatch` (first run, no pre-downloaded data)

Result: FAIL — but not because of CPU/GPU. The failure was the MNIST download race condition:

[rank1]: RuntimeError: File not found or corrupted.

Both ranks called datasets.MNIST(download=True) simultaneously on the shared FSx filesystem. Rank 1 tried to verify an MD5 checksum while rank 0 was still writing the file, causing corruption. This is a pre-existing bug that affects both CPU and GPU — it has nothing to do with the PyTorch wheel variant.

Test 3: `sbatch 1.venv-train.sbatch` (after pre-downloading MNIST on head node)

Result: PASS — training completed successfully across 2 CPU nodes:

Using GLOO backend for CPU training
Using GLOO backend for CPU training
[Gloo] Rank 0 is connected to 1 peer ranks.
[Gloo] Rank 1 is connected to 1 peer ranks.
[RANK 0] Epoch 0 | Batchsize: 32 | Steps: 938 | Loss: 0.198
[RANK 1] Epoch 0 | Batchsize: 32 | Steps: 938 | Loss: 0.199
...
[RANK 0] Epoch 9 | Batchsize: 32 | Steps: 938 | Loss: 0.023
[RANK 1] Epoch 9 | Batchsize: 32 | Steps: 938 | Loss: 0.024
worker group successfully finished. Waiting 300 seconds for other agents to finish

The existing ddp.py auto-detected CPU, selected the GLOO backend, and trained to convergence — all with the standard GPU PyTorch wheel.

Conclusion

The +cpu wheel suffix saves disk space (~2GB less NVIDIA deps) but is not required for correctness. The GPU wheels install and run fine on CPU-only instances. The only real issue is the MNIST download race condition, which is a pre-existing bug in ddp.py that affects both CPU and GPU deployments on shared filesystems.

Suggested follow-up PRs

I'd love to see you contribute these as separate, focused PRs:

Fix the MNIST download race condition in ddp.py — add the rank-0-with-barrier pattern to load_train_objs():

rank = int(os.environ.get("RANK", 0))
if rank == 0:
    datasets.MNIST(root='./data', train=True, download=True)
torch.distributed.barrier()
train_set = datasets.MNIST(root='./data', train=True, download=False, transform=transform)

Pin the base image in the existing Dockerfile — replace pytorch/pytorch:latest with a pinned version tag.

These would be valuable, targeted improvements. Thanks for the effort on this PR — your investigation of the race condition was spot-on.

aagallo added 2 commits March 26, 2026 11:37

Adding script to install Torchvision for CPU

9498f22

Signed-off-by: aagallo <aagallo@amzon.com>

Added script version for CPU training

7a13e39

Signed-off-by: aagallo <aagallo@amzon.com>

KeitaW reviewed Mar 27, 2026

View reviewed changes

KeitaW requested changes Mar 27, 2026

View reviewed changes

littlemex mentioned this pull request Mar 31, 2026

[Bug]: DDP Container training fails intermittently with PyTorch 2.2.1 on CPU-only clusters #1047

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU support for PyTorch DDP training#1040

Add CPU support for PyTorch DDP training#1040
aagallo wants to merge 2 commits intoawslabs:mainfrom
aagallo:torchvision_cpu

aagallo commented Mar 27, 2026

Uh oh!

KeitaW left a comment •

edited

Loading

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 27, 2026

Uh oh!

KeitaW Mar 27, 2026

Uh oh!

KeitaW Mar 27, 2026

Uh oh!

KeitaW Mar 27, 2026

Uh oh!

KeitaW left a comment

Uh oh!

aagallo commented Mar 27, 2026

Uh oh!

KeitaW commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# --- CHANGE THESE ---
	# Copy training script and pre-downloaded dataset

-#!/bin/bash
-set -ex
+#!/usr/bin/env bash
+set -ex
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0

Conversation

aagallo commented Mar 27, 2026

Purpose

Changes

Test Plan

Test Results

Directory Structure

Checklist

Uh oh!

KeitaW left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review 1/2 — Existing CPU Support & Approach

The existing code already supports CPU

Bugs worth fixing in the existing code

Things That Look Great

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review 2/2 — Additional Concerns if Revisited

Uh oh!

KeitaW Mar 27, 2026

Choose a reason for hiding this comment

Missing copyright header

Uh oh!

KeitaW Mar 27, 2026

Choose a reason for hiding this comment

Tutorial-style comment

Uh oh!

KeitaW Mar 27, 2026

Choose a reason for hiding this comment

Missing copyright header & shebang inconsistency

Uh oh!

KeitaW Mar 27, 2026

Choose a reason for hiding this comment

Host-side torchvision dependency

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

aagallo commented Mar 27, 2026

Core Issue: CPU-Specific PyTorch Installation

Proposed Approach

Bugs to Fix

Fixes for Current PR (if we proceed)

Uh oh!

KeitaW commented Mar 28, 2026

Test Results: Existing DDP Code on CPU Instances

Environment

Test 1: 0.create-venv.sh (existing GPU venv script)

Test 2: sbatch 1.venv-train.sbatch (first run, no pre-downloaded data)

Test 3: sbatch 1.venv-train.sbatch (after pre-downloading MNIST on head node)

Conclusion

Suggested follow-up PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KeitaW left a comment •

edited

Loading

Test 1: `0.create-venv.sh` (existing GPU venv script)

Test 2: `sbatch 1.venv-train.sbatch` (first run, no pre-downloaded data)

Test 3: `sbatch 1.venv-train.sbatch` (after pre-downloading MNIST on head node)