Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+ by aravneelaws · Pull Request #1022 · awslabs/awsome-distributed-training

aravneelaws · 2026-03-13T17:02:27Z

Purpose

Pins hyperpod-elastic-agent to v1.1.2 in the FSDP Dockerfile, which resolves compatibility issues with PyTorch 2.6+ (tested and confirmed working on PyTorch 2.7.1)
Adds a new HyperPodPyTorchJob (HPTO) manifest for Llama 3.2 1B as a lightweight training example suitable for smaller GPU instances (e.g., g5.8xlarge)

Resolves #944

Changes

Dockerfile: Pin HPEA v1.1.2

The hpto stage of 3.test_cases/pytorch/FSDP/Dockerfile is updated from:
RUN pip install hyperpod-elastic-agent
to:
RUN pip install hyperpod-elastic-agent==1.1.2

HPEA v1.1.0 (the previously available release) did not support PyTorch 2.6+, blocking adoption of more recent PyTorch versions — particularly on newer hardware such as Blackwell where PyTorch 2.9+ and CUDA 12.8 are required. v1.1.2 restores compatibility with PyTorch 2.6+. Pinning the version ensures reproducible builds.

New manifest: Llama 3.2 1B HPTO

3.test_cases/pytorch/FSDP/kubernetes/llama3_2_1b-fsdp-hpto.yaml is a new HyperPodPyTorchJob manifest for Llama 3.2 1B FSDP training. It complements the existing 8B manifest and is parameterized with envsubst variables ($IMAGE_URI, $NUM_NODES, $GPU_PER_NODE, $EFA_PER_NODE, $HF_TOKEN) for portability across cluster configurations.

Test Plan and Results

Validated end-to-end on a SageMaker HyperPod EKS cluster (8x ml.g5.8xlarge, 1 GPU + 1 EFA per node):

HPEA v1.1.2 installed cleanly from production PyPI with no workarounds
Llama 3.2 1B FSDP training ran 100 steps successfully (~2.19 samples/sec)
Checkpointing verified at steps 50 and 100
No NCCL or EFA errors during training

Checklist

I have read the contributing guidelines.
I am working against the latest main branch.
I have searched existing open and recently merged PRs to confirm this is not a duplicate.
The contribution is self-contained with documentation and scripts.
External dependencies are pinned to a specific version or tag (no latest).
A README is included or updated with prerequisites, instructions, and known issues.
New test cases follow the expected directory structure.

…rder

…emaker compute-type

…al failures

…nnecessary pip pin and pip check

…ER, revert maxFullJobRestarts to 1

KeitaW

Review Batch 1/2 — Structure & Deployment Pipeline

Thanks for this contribution, Aravind! The HPEA version pin is a clean, well-motivated fix and the new 1B manifest is a nice lightweight addition. I have a few suggestions below.

KeitaW

Review Batch 2/2 — Infrastructure & Documentation

Kubernetes README not updated with the new manifest

File: 3.test_cases/pytorch/FSDP/kubernetes/README.md (not modified in this PR)

The PR description mentions the contribution is "self-contained with documentation," but the kubernetes README isn't updated to reference the new Llama 3.2 1B HPTO manifest. Users browsing the README won't discover this new option. I'd suggest adding a section or entry documenting:

What the manifest is for (lightweight Llama 3.2 1B training on smaller instances)
Prerequisites (HyperPod EKS cluster, HPEA v1.1.2 image)
The envsubst variables needed ($IMAGE_URI, $NUM_NODES, $GPU_PER_NODE, $EFA_PER_NODE, $HF_TOKEN)

Things That Look Great

Version pinning on HPEA: Changing from unpinned pip install hyperpod-elastic-agent to ==1.1.2 is exactly the right call — this ensures reproducible builds and matches the repo convention of pinning all external dependencies.
Well-parameterized manifest: The use of envsubst variables ($IMAGE_URI, $NUM_NODES, $GPU_PER_NODE, $EFA_PER_NODE, $HF_TOKEN) makes the manifest portable across different cluster configurations.
Correct model architecture parameters: The Llama 3.2 1B hyperparameters (hidden_width=2048, num_layers=16, num_heads=32, num_key_value_heads=2, intermediate_size=8192) are accurate and use model_type=llama_v3 correctly.
Sensible training defaults for a lightweight example: max_steps=100 with checkpoint_freq=50 is a practical smoke test that still validates checkpointing works.
Comprehensive NCCL debugging env vars: The manifest includes the full set of NCCL trace/monitoring variables consistent with the existing manifests, which is valuable for troubleshooting on EFA clusters.
Thorough test plan in the PR description: End-to-end validation on actual hardware with specific metrics (2.19 samples/sec, checkpoint verification) gives confidence the changes work as intended.

KeitaW

Thanks for the update! Few comments

aravneelaws · 2026-03-26T00:04:18Z

Thanks for the update! Few comments

Thanks, Keita. Will look into this in the next couple of days and have another commit fixing the suggestions. Thanks for the thorough review.

…r 1B HPTO manifest

aravneelaws · 2026-04-03T20:39:52Z

Hi @KeitaW - I have incorporated your suggestions and left comments on some. The changes are pushed. The details are summarized below:

On moving to hyperpod-eks/ subdirectory

Thanks for the suggestion. I agree that HPTO manifests should eventually live under a dedicated hyperpod-eks/ directory. However, doing this properly in this PR would require also moving the existing 8B HPTO manifest (llama3_1_8b-fsdp-hpto.yaml), plus creating a separate README for that directory. That scope expansion doesn't fit well in this PR, which is focused on the HPEA version pin and the new 1B example. I will follow-up with another PR which will address this.

On nodeAffinity

Good catch — I've added the nodeAffinity block with the sagemaker.amazonaws.com/node-health-status: Schedulable selector, which is the important one for preventing scheduling on unhealthy nodes.

However, I've intentionally left out the sagemaker.amazonaws.com/compute-type: ${INSTANCE_TYPE} selector for now. Looking at other manifests in this repo (e.g., 4.validation_and_observability/5.nsight/EKS/llama3_2_1b-fsdp-nsight.yaml), it appears the compute-type label value is hyperpod rather than the actual instance type like ml.g5.8xlarge. If that's the case, passing ${INSTANCE_TYPE} (which users would set to something like g5.8xlarge or ml.g5.8xlarge) wouldn't match, and pods would never get scheduled. This may also be an issue in the existing 8B manifest. I want to verify the correct label value on a live cluster before adding this. I'll follow up once confirmed. This will also be addressed in the next PR.

On FI_EFA_USE_DEVICE_RDMA

Done — added an inline comment explaining this is intentionally set to 0 for g5 instances and should be 1 for p4d/p5.

On README update

Done — added a new section 4a to the README covering the HPTO manifest, including prerequisites (HyperPod EKS cluster, hpto Docker build target with HPEA v1.1.2), environment variables, launch/monitor/stop instructions, and a note about the RDMA setting. Also updated the intro paragraph to mention the HPTO manifests with a link to the new section. With the new PR, we will move this into a separate README.md in the hyperpod-eks directory.

aravneelaws added 12 commits March 10, 2026 12:21

Add llama3_2_1b HPTO manifest and update Dockerfile for HPEA v1.1.1 test

8ac49bd

Fix pip index order: use test.pypi as extra-index to avoid broken deps

856e45d

Install HPEA deps from PyPI first, then HPEA --no-deps from test.pypi

8c68bf9

Add EFA resources/envvars to HPTO YAML, fix Dockerfile deps install o…

a7e0e35

…rder

Fix nodeAffinity: use node.kubernetes.io/instance-type instead of sag…

890169a

…emaker compute-type

Add missing psutil dependency for HPEA runtime

fa93f73

Fix RDMA env for g5.8xlarge and increase maxFullJobRestarts

1dc691d

Update Dockerfile for HPEA v1.1.2 test (psutil declared in package)

40760d3

Fix Dockerfile pip check: suppress pygobject noise without masking re…

e3bedb7

…al failures

Switch HPEA install to production PyPI (no more test.pypi workarounds)

822ef9f

Simplify Dockerfile: just pin hyperpod-elastic-agent==1.1.2, remove u…

48d87af

…nnecessary pip pin and pip check

Clean up HPTO YAML: remove nodeAffinity, add FI_EFA_ENABLE_SHM_TRANSF…

34cceb0

…ER, revert maxFullJobRestarts to 1

aravneelaws requested a review from KeitaW March 13, 2026 17:03

KeitaW reviewed Mar 17, 2026

View reviewed changes

Comment thread 3.test_cases/pytorch/FSDP/kubernetes/llama3_2_1b-fsdp-hpto.yaml

Comment thread 3.test_cases/pytorch/FSDP/kubernetes/llama3_2_1b-fsdp-hpto.yaml

KeitaW reviewed Mar 17, 2026

View reviewed changes

Comment thread 3.test_cases/pytorch/FSDP/kubernetes/llama3_2_1b-fsdp-hpto.yaml Outdated

KeitaW requested changes Mar 17, 2026

View reviewed changes

Address PR review: add nodeAffinity, RDMA comment, and README docs fo…

44cecad

…r 1B HPTO manifest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+#1022

Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+#1022
aravneelaws wants to merge 13 commits intomainfrom
aravneel-test-hpea

aravneelaws commented Mar 13, 2026

Uh oh!

KeitaW left a comment

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Uh oh!

Uh oh!

KeitaW left a comment

Uh oh!

aravneelaws commented Mar 26, 2026

Uh oh!

aravneelaws commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aravneelaws commented Mar 13, 2026

Purpose

Changes

Dockerfile: Pin HPEA v1.1.2

New manifest: Llama 3.2 1B HPTO

Test Plan and Results

Checklist

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/2 — Structure & Deployment Pipeline

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/2 — Infrastructure & Documentation

Kubernetes README not updated with the new manifest

Things That Look Great

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

aravneelaws commented Mar 26, 2026

Uh oh!

aravneelaws commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants