docs: comprehensive instance hardware profiles (16 families)#1021
docs: comprehensive instance hardware profiles (16 families)#1021KeitaW wants to merge 4 commits intodocs/instance-compatibilityfrom
Conversation
Remove docs/instance-compatibility.md (central compatibility matrix) and all "Tested Configurations" / "Instance Compatibility" sections from 22 test case READMEs. These tables would not be maintained as test cases evolve, and stale compatibility data is worse than no data. The veRL OOM debugging knowledge is preserved in the instance profiles.
Rewrite all 6 existing profiles and add 10 new ones covering every accelerated instance family relevant to distributed training on AWS: New profiles: g4dn (T4), g6 (L4), g7e (RTX PRO 6000), p4d (A100 40GB), p5e (H200), p6-b200 (B200), p6-b300 (B300), p6e-gb200 (GB200 Grace), trn2/trn2u (Trainium v2), inf2 (Inferentia v2) Each profile now includes: - GPU compute TFLOPS (BF16, FP8, TF32) and memory bandwidth - NVLink/NeuronLink generation and per-GPU bandwidth - EFA generation (v1-v4) and adapter count - GPUDirect RDMA support - All available instance sizes - Distributed training considerations and model sizing guidance - Cross-reference to EFA cheatsheet instead of duplicating NCCL settings Updated README index with summary comparison tables organized by GPU family, use case, and RDMA support.
Corrections across all 15 instance profile files: - NVIDIA TFLOPS: use dense (not sparse) as primary values throughout - p6-b200: VRAM 192→179 GB per AWS docs - p6-b300: BF16 2,500→2,250, FP8 5,000→4,500, FP4 10,000→13,500 (dense) - p6e-gb200: BF16 2,250→2,500 (GB200 variant higher than standard B200) - inf2: EFA support corrected to None (Inf2 does not support EFA) - trn1: memory bandwidth 613→820 GB/s, NVMe sizes corrected - trn2: separated trn2u instance vs UltraServer in table, BF16 632→667 - g6: BF16 242→121, g7e: memory BW 1,792→1,597 GB/s - All profiles: "bisection bandwidth" → "aggregate bandwidth" for NVLink - README index updated with all corrected values
Add three high-level summary tables to the instance profiles README: - GPU/Accelerator Quick Reference with BF16, FP8, FP4, TF32/FP32 TFLOPS - NVIDIA Instance Quick Reference with GPU memory, NVLink, and EFA specs - Trainium/Inferentia Instance Quick Reference with chip and interconnect details Also update p5.md to include p5.4xlarge in the Covers line.
paragao
left a comment
There was a problem hiding this comment.
Review
Great work on this PR. The 16 instance profiles are well-structured, consistent in format, and provide actionable guidance — especially the NCCL/EFA configuration sections and distributed training considerations. Consolidating from scattered per-README tables into centralized profiles is the right call for long-term maintainability.
A few suggestions to tighten up the data consistency before merging:
1. Trainium v1 Memory Bandwidth — inconsistent across files
docs/instance-profiles/trn1.md(Hardware at a Glance) anddocs/instance-profiles/README.md(Quick Reference) both list 820 GB/s per chip.docs/instance-profiles/trn2.md(comparison table) lists 613 GB/s for trn1.
Could you cross-check against the official AWS documentation and update all three files to use the same verified value? The 4.7x multiplier in the trn2 comparison table is derived from 613, so whichever value is correct, the multiplier will need to be updated accordingly.
2. Trainium v2 BF16 TFLOPS — inconsistent within trn2.md
docs/instance-profiles/trn2.mdHardware at a Glance anddocs/instance-profiles/README.mdQuick Reference both say ~667 TFLOPS per chip.- The comparison table in the same
trn2.mdfile says ~632 TFLOPS with a 3.3x multiplier. - The Key Characteristics section claims "3.5x the BF16 compute" which aligns with 667 but not 632.
Worth verifying the correct figure against AWS documentation and making the Hardware at a Glance table, comparison table, and Key Characteristics text all consistent.
3. p4de comparison table — apples-to-oranges TFLOPS comparison
In docs/instance-profiles/p4de.md, the comparison with p5en shows:
| BF16 TFLOPS/GPU | 312 | 1,979 |
The A100 value (312) is dense BF16, while the H200 value (1,979) is sparse BF16. The H200 dense BF16 is 990 TFLOPS. Suggest using consistent precision — either both dense (312 vs 990) or labeling them explicitly as "(dense)" / "(sparse)" so readers aren't misled by the apparent ~6x gap.
4. veRL README — add a cross-reference to the g5 profile
The veRL RLVR READMEs (3.test_cases/pytorch/verl/kubernetes/rlvr/README.md and 3.test_cases/pytorch/verl/hyperpod-eks/rlvr/README.md) previously had a detailed g5-specific OOM parameter table. The general lesson is preserved in g5.md, but the workload-specific parameter guidance is lost. A one-line note such as:
For g5 instance memory optimization guidance, see the g5 instance profile.
would help users running veRL on g5 find the relevant information.
Overall this is a strong documentation improvement. The suggestions above are non-blocking but would improve accuracy and completeness. Nice work.
Summary
Builds on #1017 to make the instance profiles comprehensive and authoritative. This PR:
docs/instance-compatibility.md) and all "Tested Configurations" / "Instance Compatibility" sections from 22 test case READMEs — these would not be maintained as test cases evolveInstance Profiles (16 total)
NVIDIA G-Family (PCIe, no NVSwitch):
g4dn.md— T4 (Turing, 16 GB GDDR6) NEWg5.md— A10G (Ampere, 24 GB GDDR6) rewritteng6.md— L4 (Ada Lovelace, 24 GB GDDR6) NEWg6e.md— L40S (Ada Lovelace, 48 GB GDDR6) rewritteng7e.md— RTX PRO 6000 (Blackwell, 96 GB GDDR7) NEWNVIDIA P-Family (NVSwitch, GPUDirect RDMA):
p4d.md— A100 40 GB (Ampere) NEWp4de.md— A100 80 GB (Ampere) rewrittenp5.md— H100 (Hopper, 80 GB HBM3) rewrittenp5e.md— H200 (Hopper, 141 GB HBM3e) NEWp5en.md— H200 (Hopper, 141 GB HBM3e, EFAv3) rewrittenp6-b200.md— B200 (Blackwell, 192 GB HBM3e) NEWp6-b300.md— B300 (Blackwell Ultra, 288 GB HBM3e) NEWp6e-gb200.md— GB200 Grace Blackwell (72-GPU NVLink domain) NEWAWS Custom Silicon:
trn1.md— Trainium v1 (trn1/trn1n) rewrittentrn2.md— Trainium v2 (trn2/trn2u UltraServer) NEWinf2.md— Inferentia v2 NEWWhat each profile now includes
What was removed and why
docs/instance-compatibility.md— central compatibility matrix that would go staleTest plan
instance-compatibility.md