Skip to content

docs: comprehensive instance hardware profiles (16 families)#1021

Draft
KeitaW wants to merge 4 commits intodocs/instance-compatibilityfrom
docs/improved-instance-profiles
Draft

docs: comprehensive instance hardware profiles (16 families)#1021
KeitaW wants to merge 4 commits intodocs/instance-compatibilityfrom
docs/improved-instance-profiles

Conversation

@KeitaW
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW commented Mar 13, 2026

Summary

Builds on #1017 to make the instance profiles comprehensive and authoritative. This PR:

  • Removes the central compatibility matrix (docs/instance-compatibility.md) and all "Tested Configurations" / "Instance Compatibility" sections from 22 test case READMEs — these would not be maintained as test cases evolve
  • Rewrites all 6 existing instance profiles with complete hardware specs
  • Adds 10 new profiles covering every accelerated instance family on AWS

Instance Profiles (16 total)

NVIDIA G-Family (PCIe, no NVSwitch):

  • g4dn.md — T4 (Turing, 16 GB GDDR6) NEW
  • g5.md — A10G (Ampere, 24 GB GDDR6) rewritten
  • g6.md — L4 (Ada Lovelace, 24 GB GDDR6) NEW
  • g6e.md — L40S (Ada Lovelace, 48 GB GDDR6) rewritten
  • g7e.md — RTX PRO 6000 (Blackwell, 96 GB GDDR7) NEW

NVIDIA P-Family (NVSwitch, GPUDirect RDMA):

  • p4d.md — A100 40 GB (Ampere) NEW
  • p4de.md — A100 80 GB (Ampere) rewritten
  • p5.md — H100 (Hopper, 80 GB HBM3) rewritten
  • p5e.md — H200 (Hopper, 141 GB HBM3e) NEW
  • p5en.md — H200 (Hopper, 141 GB HBM3e, EFAv3) rewritten
  • p6-b200.md — B200 (Blackwell, 192 GB HBM3e) NEW
  • p6-b300.md — B300 (Blackwell Ultra, 288 GB HBM3e) NEW
  • p6e-gb200.md — GB200 Grace Blackwell (72-GPU NVLink domain) NEW

AWS Custom Silicon:

  • trn1.md — Trainium v1 (trn1/trn1n) rewritten
  • trn2.md — Trainium v2 (trn2/trn2u UltraServer) NEW
  • inf2.md — Inferentia v2 NEW

What each profile now includes

  • GPU compute TFLOPS (BF16, FP8, TF32) and memory bandwidth
  • NVLink/NeuronLink generation and per-GPU bandwidth
  • EFA generation (v1→v4) and adapter count
  • GPUDirect RDMA support
  • All available instance sizes in the family
  • Distributed training considerations and model sizing guidance
  • Cross-references to the EFA cheatsheet (no NCCL settings duplication)

What was removed and why

  • docs/instance-compatibility.md — central compatibility matrix that would go stale
  • 22 README sections — "Tested Configurations" and "Instance Compatibility" links that reference the removed matrix
  • The veRL OOM debugging knowledge is preserved in the g5 profile's training considerations

Test plan

KeitaW added 2 commits March 13, 2026 00:42
Remove docs/instance-compatibility.md (central compatibility matrix) and
all "Tested Configurations" / "Instance Compatibility" sections from 22
test case READMEs. These tables would not be maintained as test cases
evolve, and stale compatibility data is worse than no data.

The veRL OOM debugging knowledge is preserved in the instance profiles.
Rewrite all 6 existing profiles and add 10 new ones covering every
accelerated instance family relevant to distributed training on AWS:

New profiles: g4dn (T4), g6 (L4), g7e (RTX PRO 6000), p4d (A100 40GB),
p5e (H200), p6-b200 (B200), p6-b300 (B300), p6e-gb200 (GB200 Grace),
trn2/trn2u (Trainium v2), inf2 (Inferentia v2)

Each profile now includes:
- GPU compute TFLOPS (BF16, FP8, TF32) and memory bandwidth
- NVLink/NeuronLink generation and per-GPU bandwidth
- EFA generation (v1-v4) and adapter count
- GPUDirect RDMA support
- All available instance sizes
- Distributed training considerations and model sizing guidance
- Cross-reference to EFA cheatsheet instead of duplicating NCCL settings

Updated README index with summary comparison tables organized by
GPU family, use case, and RDMA support.
@KeitaW KeitaW marked this pull request as draft March 13, 2026 00:45
KeitaW added 2 commits March 13, 2026 02:22
Corrections across all 15 instance profile files:
- NVIDIA TFLOPS: use dense (not sparse) as primary values throughout
- p6-b200: VRAM 192→179 GB per AWS docs
- p6-b300: BF16 2,500→2,250, FP8 5,000→4,500, FP4 10,000→13,500 (dense)
- p6e-gb200: BF16 2,250→2,500 (GB200 variant higher than standard B200)
- inf2: EFA support corrected to None (Inf2 does not support EFA)
- trn1: memory bandwidth 613→820 GB/s, NVMe sizes corrected
- trn2: separated trn2u instance vs UltraServer in table, BF16 632→667
- g6: BF16 242→121, g7e: memory BW 1,792→1,597 GB/s
- All profiles: "bisection bandwidth" → "aggregate bandwidth" for NVLink
- README index updated with all corrected values
Add three high-level summary tables to the instance profiles README:
- GPU/Accelerator Quick Reference with BF16, FP8, FP4, TF32/FP32 TFLOPS
- NVIDIA Instance Quick Reference with GPU memory, NVLink, and EFA specs
- Trainium/Inferentia Instance Quick Reference with chip and interconnect details

Also update p5.md to include p5.4xlarge in the Covers line.
Copy link
Copy Markdown
Contributor

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Great work on this PR. The 16 instance profiles are well-structured, consistent in format, and provide actionable guidance — especially the NCCL/EFA configuration sections and distributed training considerations. Consolidating from scattered per-README tables into centralized profiles is the right call for long-term maintainability.

A few suggestions to tighten up the data consistency before merging:

1. Trainium v1 Memory Bandwidth — inconsistent across files

  • docs/instance-profiles/trn1.md (Hardware at a Glance) and docs/instance-profiles/README.md (Quick Reference) both list 820 GB/s per chip.
  • docs/instance-profiles/trn2.md (comparison table) lists 613 GB/s for trn1.

Could you cross-check against the official AWS documentation and update all three files to use the same verified value? The 4.7x multiplier in the trn2 comparison table is derived from 613, so whichever value is correct, the multiplier will need to be updated accordingly.

2. Trainium v2 BF16 TFLOPS — inconsistent within trn2.md

  • docs/instance-profiles/trn2.md Hardware at a Glance and docs/instance-profiles/README.md Quick Reference both say ~667 TFLOPS per chip.
  • The comparison table in the same trn2.md file says ~632 TFLOPS with a 3.3x multiplier.
  • The Key Characteristics section claims "3.5x the BF16 compute" which aligns with 667 but not 632.

Worth verifying the correct figure against AWS documentation and making the Hardware at a Glance table, comparison table, and Key Characteristics text all consistent.

3. p4de comparison table — apples-to-oranges TFLOPS comparison

In docs/instance-profiles/p4de.md, the comparison with p5en shows:

| BF16 TFLOPS/GPU | 312 | 1,979 |

The A100 value (312) is dense BF16, while the H200 value (1,979) is sparse BF16. The H200 dense BF16 is 990 TFLOPS. Suggest using consistent precision — either both dense (312 vs 990) or labeling them explicitly as "(dense)" / "(sparse)" so readers aren't misled by the apparent ~6x gap.

4. veRL README — add a cross-reference to the g5 profile

The veRL RLVR READMEs (3.test_cases/pytorch/verl/kubernetes/rlvr/README.md and 3.test_cases/pytorch/verl/hyperpod-eks/rlvr/README.md) previously had a detailed g5-specific OOM parameter table. The general lesson is preserved in g5.md, but the workload-specific parameter guidance is lost. A one-line note such as:

For g5 instance memory optimization guidance, see the g5 instance profile.

would help users running veRL on g5 find the relevant information.


Overall this is a strong documentation improvement. The suggestions above are non-blocking but would improve accuracy and completeness. Nice work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants