Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+#1022
Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+#1022aravneelaws wants to merge 13 commits intomainfrom
Conversation
…emaker compute-type
…nnecessary pip pin and pip check
…ER, revert maxFullJobRestarts to 1
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/2 — Structure & Deployment Pipeline
Thanks for this contribution, Aravind! The HPEA version pin is a clean, well-motivated fix and the new 1B manifest is a nice lightweight addition. I have a few suggestions below.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/2 — Infrastructure & Documentation
Kubernetes README not updated with the new manifest
File: 3.test_cases/pytorch/FSDP/kubernetes/README.md (not modified in this PR)
The PR description mentions the contribution is "self-contained with documentation," but the kubernetes README isn't updated to reference the new Llama 3.2 1B HPTO manifest. Users browsing the README won't discover this new option. I'd suggest adding a section or entry documenting:
- What the manifest is for (lightweight Llama 3.2 1B training on smaller instances)
- Prerequisites (HyperPod EKS cluster, HPEA v1.1.2 image)
- The envsubst variables needed (
$IMAGE_URI,$NUM_NODES,$GPU_PER_NODE,$EFA_PER_NODE,$HF_TOKEN)
Things That Look Great
- Version pinning on HPEA: Changing from unpinned
pip install hyperpod-elastic-agentto==1.1.2is exactly the right call — this ensures reproducible builds and matches the repo convention of pinning all external dependencies. - Well-parameterized manifest: The use of
envsubstvariables ($IMAGE_URI,$NUM_NODES,$GPU_PER_NODE,$EFA_PER_NODE,$HF_TOKEN) makes the manifest portable across different cluster configurations. - Correct model architecture parameters: The Llama 3.2 1B hyperparameters (
hidden_width=2048,num_layers=16,num_heads=32,num_key_value_heads=2,intermediate_size=8192) are accurate and usemodel_type=llama_v3correctly. - Sensible training defaults for a lightweight example:
max_steps=100withcheckpoint_freq=50is a practical smoke test that still validates checkpointing works. - Comprehensive NCCL debugging env vars: The manifest includes the full set of NCCL trace/monitoring variables consistent with the existing manifests, which is valuable for troubleshooting on EFA clusters.
- Thorough test plan in the PR description: End-to-end validation on actual hardware with specific metrics (2.19 samples/sec, checkpoint verification) gives confidence the changes work as intended.
KeitaW
left a comment
There was a problem hiding this comment.
Thanks for the update! Few comments
Thanks, Keita. Will look into this in the next couple of days and have another commit fixing the suggestions. Thanks for the thorough review. |
…r 1B HPTO manifest
|
Hi @KeitaW - I have incorporated your suggestions and left comments on some. The changes are pushed. The details are summarized below: On moving to hyperpod-eks/ subdirectory
On nodeAffinity
On FI_EFA_USE_DEVICE_RDMA
On README update
|
Purpose
hyperpod-elastic-agentto v1.1.2 in the FSDP Dockerfile, which resolves compatibility issues with PyTorch 2.6+ (tested and confirmed working on PyTorch 2.7.1)Resolves #944
Changes
Dockerfile: Pin HPEA v1.1.2
The hpto stage of 3.test_cases/pytorch/FSDP/Dockerfile is updated from:
RUN pip install hyperpod-elastic-agentto:
RUN pip install hyperpod-elastic-agent==1.1.2HPEA v1.1.0 (the previously available release) did not support PyTorch 2.6+, blocking adoption of more recent PyTorch versions — particularly on newer hardware such as Blackwell where PyTorch 2.9+ and CUDA 12.8 are required. v1.1.2 restores compatibility with PyTorch 2.6+. Pinning the version ensures reproducible builds.
New manifest: Llama 3.2 1B HPTO
3.test_cases/pytorch/FSDP/kubernetes/llama3_2_1b-fsdp-hpto.yaml is a new
HyperPodPyTorchJobmanifest for Llama 3.2 1B FSDP training. It complements the existing 8B manifest and is parameterized with envsubst variables ($IMAGE_URI,$NUM_NODES,$GPU_PER_NODE,$EFA_PER_NODE,$HF_TOKEN) for portability across cluster configurations.Test Plan and Results
Validated end-to-end on a SageMaker HyperPod EKS cluster (8x ml.g5.8xlarge, 1 GPU + 1 EFA per node):
Checklist
mainbranch.latest).