This repository provides CloudFormation templates and examples for running distributed training jobs on AWS Batch using GPU instances. The architecture can be easily modified to accommodate different instance types including Trainium (Trn) and other P-series instances.
- Prerequisites
- Architecture Overview
- Available Templates
- P4 Instance Deployment
- P5 Instance Deployment
- P6 Instance Deployment
- Important Considerations
⚠️ Important: You must deploy the VPC template2.vpc-one-az.yamlbefore deploying any Batch template. The Batch templates automatically fetch the EFA Security Group ID and Subnet ID from the VPC template's exported values.
This architecture consists of the following AWS resources:
| Component | Purpose | Documentation |
|---|---|---|
| AWS Batch Compute Environment | Manages compute resources for multi-node parallel jobs (similar to a compute cluster) | AWS Docs |
| AWS Batch Job Queue | Queues jobs for execution (similar to Slurm/LSF schedulers) | AWS Docs |
| EC2 Launch Template | Configures EFA network interfaces for high-performance networking | AWS Docs |
| Job Definition | Template for job execution, references container images | AWS Docs |
| ECR Container Registry | Stores Docker container images | AWS Docs |
| Template | Instance Types | Features |
|---|---|---|
0.aws-batch-distributed-training.yaml |
P4d.24xlarge (default) | Standard deployment with 4 EFA interfaces |
0.aws-batch-distributed-training-p5.yaml |
P5.48xlarge | Optimized for P5 instances |
aws-batch-distributed-training-p6.yaml |
P6-b200.48xlarge | P6 deployment with sample AWS Batch MNP job setup |
Deploy the standard template with one click:
| Parameter | Type | Description |
|---|---|---|
VPCStackParameter |
Required | Name of the VPC CloudFormation stack |
AMI |
Optional | Custom AMI ID (leave blank for default) |
CapacityReservationId |
Optional | EC2 Capacity Reservation ID |
CapacityReservationResourceGroup |
Optional | Alternative to CapacityReservationId |
EC2KeyPair |
Optional | EC2 key pair for SSH debugging |
CreatePlacementGroup |
Optional | Create placement group for instances |
aws cloudformation create-stack \
--stack-name aws-batch-distributed-training \
--template-body file://0.aws-batch-distributed-training.yaml \
--parameters \
ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \
ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
--capabilities CAPABILITY_NAMED_IAM| Parameter | Type | Description |
|---|---|---|
VPCStackParameter |
Required | Name of the VPC CloudFormation stack |
CapacityReservationId |
Required | Capacity Reservation ID (e.g., cr-1234567890) |
aws cloudformation create-stack \
--stack-name batch-p6 \
--template-body file://aws-batch-distributed-training-p6.yaml \
--parameters \
ParameterKey=VPCStackParameter,ParameterValue="your-vpc-stack-name" \
ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
--capabilities CAPABILITY_NAMED_IAM# Generate SSH key pair
ssh-keygen -t rsa -b 2048 -N '' -f /tmp/batch_key
# Store private key in Secrets Manager
aws secretsmanager put-secret-value \
--secret-id batch-p6-ssh-key \
--secret-string file:///tmp/batch_key
# Clean up temporary files
rm /tmp/batch_key /tmp/batch_key.pubSubmit a multi-node NCCL test job to verify the setup:
# Retrieve stack outputs
JOB_DEFINITION=$(aws cloudformation describe-stacks \
--stack-name batch-p6 \
--query 'Stacks[0].Outputs[?OutputKey==`JobDefinitionMultiInstance`].OutputValue' \
--output text)
JOB_QUEUE=$(aws cloudformation describe-stacks \
--stack-name batch-p6 \
--query 'Stacks[0].Outputs[?OutputKey==`DistributedDeepLearningJQ`].OutputValue' \
--output text)
# Submit test job
aws batch submit-job \
--job-name nccl-test-2node \
--job-queue ${JOB_QUEUE} \
--job-definition ${JOB_DEFINITION} \
--node-overrides numNodes=2
# Monitor job status
aws batch describe-jobs --jobs <job-id>
# View logs
aws logs tail /aws/batch/job --follow- Container Image:
public.ecr.aws/hpc-cloud/nccl-tests:latest - Network Configuration: 8 EFA interfaces per instance
- SSH Setup: Automated via inline bash script in Job Definition
- Default Test:
all_reduce_perfwith 8 GPUs per node (16 total processes for 2-node job) - Key Management: SSH keys retrieved from Secrets Manager at container startup
- EFA interfaces must be explicitly declared in the EC2 Launch Template
- The EFA security group must be provided and properly configured
- Network performance is critical for distributed training workloads
- The Compute Environment retrieves private subnet information from the VPC template
- Ensure the VPC template exports the required subnet and security group values
- Both templates must be deployed in the same AWS region
- Use Capacity Reservations for guaranteed instance availability
- Consider using Capacity Reservation Resource Groups for easier management
- Monitor your EC2 limits and request increases if needed
