docs: comprehensive instance hardware profiles (16 families) by KeitaW · Pull Request #1021 · awslabs/awsome-distributed-training

KeitaW · 2026-03-13T00:43:07Z

Summary

Builds on #1017 to make the instance profiles comprehensive and authoritative. This PR:

Removes the central compatibility matrix (docs/instance-compatibility.md) and all "Tested Configurations" / "Instance Compatibility" sections from 22 test case READMEs — these would not be maintained as test cases evolve
Rewrites all 6 existing instance profiles with complete hardware specs
Adds 10 new profiles covering every accelerated instance family on AWS

Instance Profiles (16 total)

NVIDIA G-Family (PCIe, no NVSwitch):

g4dn.md — T4 (Turing, 16 GB GDDR6) NEW
g5.md — A10G (Ampere, 24 GB GDDR6) rewritten
g6.md — L4 (Ada Lovelace, 24 GB GDDR6) NEW
g6e.md — L40S (Ada Lovelace, 48 GB GDDR6) rewritten
g7e.md — RTX PRO 6000 (Blackwell, 96 GB GDDR7) NEW

NVIDIA P-Family (NVSwitch, GPUDirect RDMA):

p4d.md — A100 40 GB (Ampere) NEW
p4de.md — A100 80 GB (Ampere) rewritten
p5.md — H100 (Hopper, 80 GB HBM3) rewritten
p5e.md — H200 (Hopper, 141 GB HBM3e) NEW
p5en.md — H200 (Hopper, 141 GB HBM3e, EFAv3) rewritten
p6-b200.md — B200 (Blackwell, 192 GB HBM3e) NEW
p6-b300.md — B300 (Blackwell Ultra, 288 GB HBM3e) NEW
p6e-gb200.md — GB200 Grace Blackwell (72-GPU NVLink domain) NEW

AWS Custom Silicon:

trn1.md — Trainium v1 (trn1/trn1n) rewritten
trn2.md — Trainium v2 (trn2/trn2u UltraServer) NEW
inf2.md — Inferentia v2 NEW

What each profile now includes

GPU compute TFLOPS (BF16, FP8, TF32) and memory bandwidth
NVLink/NeuronLink generation and per-GPU bandwidth
EFA generation (v1→v4) and adapter count
GPUDirect RDMA support
All available instance sizes in the family
Distributed training considerations and model sizing guidance
Cross-references to the EFA cheatsheet (no NCCL settings duplication)

What was removed and why

docs/instance-compatibility.md — central compatibility matrix that would go stale
22 README sections — "Tested Configurations" and "Instance Compatibility" links that reference the removed matrix
The veRL OOM debugging knowledge is preserved in the g5 profile's training considerations

Test plan

Verify all internal links resolve (profile cross-references, EFA cheatsheet link)
Verify no test case README references instance-compatibility.md
Spot-check hardware specs against official AWS and NVIDIA documentation
Confirm deepspeed README retains all non-compatibility content from docs: add Instance Compatibility Guide with per-test-case configuration tables #1017

Remove docs/instance-compatibility.md (central compatibility matrix) and all "Tested Configurations" / "Instance Compatibility" sections from 22 test case READMEs. These tables would not be maintained as test cases evolve, and stale compatibility data is worse than no data. The veRL OOM debugging knowledge is preserved in the instance profiles.

Rewrite all 6 existing profiles and add 10 new ones covering every accelerated instance family relevant to distributed training on AWS: New profiles: g4dn (T4), g6 (L4), g7e (RTX PRO 6000), p4d (A100 40GB), p5e (H200), p6-b200 (B200), p6-b300 (B300), p6e-gb200 (GB200 Grace), trn2/trn2u (Trainium v2), inf2 (Inferentia v2) Each profile now includes: - GPU compute TFLOPS (BF16, FP8, TF32) and memory bandwidth - NVLink/NeuronLink generation and per-GPU bandwidth - EFA generation (v1-v4) and adapter count - GPUDirect RDMA support - All available instance sizes - Distributed training considerations and model sizing guidance - Cross-reference to EFA cheatsheet instead of duplicating NCCL settings Updated README index with summary comparison tables organized by GPU family, use case, and RDMA support.

Corrections across all 15 instance profile files: - NVIDIA TFLOPS: use dense (not sparse) as primary values throughout - p6-b200: VRAM 192→179 GB per AWS docs - p6-b300: BF16 2,500→2,250, FP8 5,000→4,500, FP4 10,000→13,500 (dense) - p6e-gb200: BF16 2,250→2,500 (GB200 variant higher than standard B200) - inf2: EFA support corrected to None (Inf2 does not support EFA) - trn1: memory bandwidth 613→820 GB/s, NVMe sizes corrected - trn2: separated trn2u instance vs UltraServer in table, BF16 632→667 - g6: BF16 242→121, g7e: memory BW 1,792→1,597 GB/s - All profiles: "bisection bandwidth" → "aggregate bandwidth" for NVLink - README index updated with all corrected values

Add three high-level summary tables to the instance profiles README: - GPU/Accelerator Quick Reference with BF16, FP8, FP4, TF32/FP32 TFLOPS - NVIDIA Instance Quick Reference with GPU memory, NVLink, and EFA specs - Trainium/Inferentia Instance Quick Reference with chip and interconnect details Also update p5.md to include p5.4xlarge in the Covers line.

paragao

Review

Great work on this PR. The 16 instance profiles are well-structured, consistent in format, and provide actionable guidance — especially the NCCL/EFA configuration sections and distributed training considerations. Consolidating from scattered per-README tables into centralized profiles is the right call for long-term maintainability.

A few suggestions to tighten up the data consistency before merging:

1. Trainium v1 Memory Bandwidth — inconsistent across files

docs/instance-profiles/trn1.md (Hardware at a Glance) and docs/instance-profiles/README.md (Quick Reference) both list 820 GB/s per chip.
docs/instance-profiles/trn2.md (comparison table) lists 613 GB/s for trn1.

Could you cross-check against the official AWS documentation and update all three files to use the same verified value? The 4.7x multiplier in the trn2 comparison table is derived from 613, so whichever value is correct, the multiplier will need to be updated accordingly.

2. Trainium v2 BF16 TFLOPS — inconsistent within trn2.md

docs/instance-profiles/trn2.md Hardware at a Glance and docs/instance-profiles/README.md Quick Reference both say ~667 TFLOPS per chip.
The comparison table in the same trn2.md file says ~632 TFLOPS with a 3.3x multiplier.
The Key Characteristics section claims "3.5x the BF16 compute" which aligns with 667 but not 632.

Worth verifying the correct figure against AWS documentation and making the Hardware at a Glance table, comparison table, and Key Characteristics text all consistent.

3. p4de comparison table — apples-to-oranges TFLOPS comparison

In docs/instance-profiles/p4de.md, the comparison with p5en shows:

| BF16 TFLOPS/GPU | 312 | 1,979 |

The A100 value (312) is dense BF16, while the H200 value (1,979) is sparse BF16. The H200 dense BF16 is 990 TFLOPS. Suggest using consistent precision — either both dense (312 vs 990) or labeling them explicitly as "(dense)" / "(sparse)" so readers aren't misled by the apparent ~6x gap.

4. veRL README — add a cross-reference to the g5 profile

The veRL RLVR READMEs (3.test_cases/pytorch/verl/kubernetes/rlvr/README.md and 3.test_cases/pytorch/verl/hyperpod-eks/rlvr/README.md) previously had a detailed g5-specific OOM parameter table. The general lesson is preserved in g5.md, but the workload-specific parameter guidance is lost. A one-line note such as:

For g5 instance memory optimization guidance, see the g5 instance profile.

would help users running veRL on g5 find the relevant information.

Overall this is a strong documentation improvement. The suggestions above are non-blocking but would improve accuracy and completeness. Nice work.

KeitaW added 2 commits March 13, 2026 00:42

KeitaW marked this pull request as draft March 13, 2026 00:45

KeitaW added 2 commits March 13, 2026 02:22

paragao approved these changes Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: comprehensive instance hardware profiles (16 families)#1021

docs: comprehensive instance hardware profiles (16 families)#1021
KeitaW wants to merge 4 commits intodocs/instance-compatibilityfrom
docs/improved-instance-profiles

KeitaW commented Mar 13, 2026

Uh oh!

paragao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KeitaW commented Mar 13, 2026

Summary

Instance Profiles (16 total)

What each profile now includes

What was removed and why

Test plan

Uh oh!

paragao left a comment

Choose a reason for hiding this comment

Review

1. Trainium v1 Memory Bandwidth — inconsistent across files

2. Trainium v2 BF16 TFLOPS — inconsistent within trn2.md

3. p4de comparison table — apples-to-oranges TFLOPS comparison

4. veRL README — add a cross-reference to the g5 profile

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants