Skip to content

Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+#1022

Open
aravneelaws wants to merge 13 commits intomainfrom
aravneel-test-hpea
Open

Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+#1022
aravneelaws wants to merge 13 commits intomainfrom
aravneel-test-hpea

Conversation

@aravneelaws
Copy link
Copy Markdown
Contributor

Purpose

  • Pins hyperpod-elastic-agent to v1.1.2 in the FSDP Dockerfile, which resolves compatibility issues with PyTorch 2.6+ (tested and confirmed working on PyTorch 2.7.1)
  • Adds a new HyperPodPyTorchJob (HPTO) manifest for Llama 3.2 1B as a lightweight training example suitable for smaller GPU instances (e.g., g5.8xlarge)

Resolves #944

Changes

Dockerfile: Pin HPEA v1.1.2

The hpto stage of 3.test_cases/pytorch/FSDP/Dockerfile is updated from:
RUN pip install hyperpod-elastic-agent
to:
RUN pip install hyperpod-elastic-agent==1.1.2

HPEA v1.1.0 (the previously available release) did not support PyTorch 2.6+, blocking adoption of more recent PyTorch versions — particularly on newer hardware such as Blackwell where PyTorch 2.9+ and CUDA 12.8 are required. v1.1.2 restores compatibility with PyTorch 2.6+. Pinning the version ensures reproducible builds.

New manifest: Llama 3.2 1B HPTO

3.test_cases/pytorch/FSDP/kubernetes/llama3_2_1b-fsdp-hpto.yaml is a new HyperPodPyTorchJob manifest for Llama 3.2 1B FSDP training. It complements the existing 8B manifest and is parameterized with envsubst variables ($IMAGE_URI, $NUM_NODES, $GPU_PER_NODE, $EFA_PER_NODE, $HF_TOKEN) for portability across cluster configurations.

Test Plan and Results

Validated end-to-end on a SageMaker HyperPod EKS cluster (8x ml.g5.8xlarge, 1 GPU + 1 EFA per node):

  • HPEA v1.1.2 installed cleanly from production PyPI with no workarounds
  • Llama 3.2 1B FSDP training ran 100 steps successfully (~2.19 samples/sec)
  • Checkpointing verified at steps 50 and 100
  • No NCCL or EFA errors during training

Checklist

  • I have read the contributing guidelines.
  • I am working against the latest main branch.
  • I have searched existing open and recently merged PRs to confirm this is not a duplicate.
  • The contribution is self-contained with documentation and scripts.
  • External dependencies are pinned to a specific version or tag (no latest).
  • A README is included or updated with prerequisites, instructions, and known issues.
  • New test cases follow the expected directory structure.

@aravneelaws aravneelaws requested a review from KeitaW March 13, 2026 17:03
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/2 — Structure & Deployment Pipeline

Thanks for this contribution, Aravind! The HPEA version pin is a clean, well-motivated fix and the new 1B manifest is a nice lightweight addition. I have a few suggestions below.

Comment thread 3.test_cases/pytorch/FSDP/kubernetes/llama3_2_1b-fsdp-hpto.yaml
Comment thread 3.test_cases/pytorch/FSDP/kubernetes/llama3_2_1b-fsdp-hpto.yaml
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/2 — Infrastructure & Documentation

Kubernetes README not updated with the new manifest

File: 3.test_cases/pytorch/FSDP/kubernetes/README.md (not modified in this PR)

The PR description mentions the contribution is "self-contained with documentation," but the kubernetes README isn't updated to reference the new Llama 3.2 1B HPTO manifest. Users browsing the README won't discover this new option. I'd suggest adding a section or entry documenting:

  • What the manifest is for (lightweight Llama 3.2 1B training on smaller instances)
  • Prerequisites (HyperPod EKS cluster, HPEA v1.1.2 image)
  • The envsubst variables needed ($IMAGE_URI, $NUM_NODES, $GPU_PER_NODE, $EFA_PER_NODE, $HF_TOKEN)

Things That Look Great

  • Version pinning on HPEA: Changing from unpinned pip install hyperpod-elastic-agent to ==1.1.2 is exactly the right call — this ensures reproducible builds and matches the repo convention of pinning all external dependencies.
  • Well-parameterized manifest: The use of envsubst variables ($IMAGE_URI, $NUM_NODES, $GPU_PER_NODE, $EFA_PER_NODE, $HF_TOKEN) makes the manifest portable across different cluster configurations.
  • Correct model architecture parameters: The Llama 3.2 1B hyperparameters (hidden_width=2048, num_layers=16, num_heads=32, num_key_value_heads=2, intermediate_size=8192) are accurate and use model_type=llama_v3 correctly.
  • Sensible training defaults for a lightweight example: max_steps=100 with checkpoint_freq=50 is a practical smoke test that still validates checkpointing works.
  • Comprehensive NCCL debugging env vars: The manifest includes the full set of NCCL trace/monitoring variables consistent with the existing manifests, which is valuable for troubleshooting on EFA clusters.
  • Thorough test plan in the PR description: End-to-end validation on actual hardware with specific metrics (2.19 samples/sec, checkpoint verification) gives confidence the changes work as intended.

Comment thread 3.test_cases/pytorch/FSDP/kubernetes/llama3_2_1b-fsdp-hpto.yaml Outdated
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Few comments

@aravneelaws
Copy link
Copy Markdown
Contributor Author

Thanks for the update! Few comments

Thanks, Keita. Will look into this in the next couple of days and have another commit fixing the suggestions. Thanks for the thorough review.

@aravneelaws
Copy link
Copy Markdown
Contributor Author

Hi @KeitaW - I have incorporated your suggestions and left comments on some. The changes are pushed. The details are summarized below:

On moving to hyperpod-eks/ subdirectory

Thanks for the suggestion. I agree that HPTO manifests should eventually live under a dedicated hyperpod-eks/ directory. However, doing this properly in this PR would require also moving the existing 8B HPTO manifest (llama3_1_8b-fsdp-hpto.yaml), plus creating a separate README for that directory. That scope expansion doesn't fit well in this PR, which is focused on the HPEA version pin and the new 1B example. I will follow-up with another PR which will address this.

On nodeAffinity

Good catch — I've added the nodeAffinity block with the sagemaker.amazonaws.com/node-health-status: Schedulable selector, which is the important one for preventing scheduling on unhealthy nodes.

However, I've intentionally left out the sagemaker.amazonaws.com/compute-type: ${INSTANCE_TYPE} selector for now. Looking at other manifests in this repo (e.g., 4.validation_and_observability/5.nsight/EKS/llama3_2_1b-fsdp-nsight.yaml), it appears the compute-type label value is hyperpod rather than the actual instance type like ml.g5.8xlarge. If that's the case, passing ${INSTANCE_TYPE} (which users would set to something like g5.8xlarge or ml.g5.8xlarge) wouldn't match, and pods would never get scheduled. This may also be an issue in the existing 8B manifest. I want to verify the correct label value on a live cluster before adding this. I'll follow up once confirmed. This will also be addressed in the next PR.

On FI_EFA_USE_DEVICE_RDMA

Done — added an inline comment explaining this is intentionally set to 0 for g5 instances and should be 1 for p4d/p5.

On README update

Done — added a new section 4a to the README covering the HPTO manifest, including prerequisites (HyperPod EKS cluster, hpto Docker build target with HPEA v1.1.2), environment variables, launch/monitor/stop instructions, and a note about the RDMA setting. Also updated the intro paragraph to mention the HPTO manifests with a link to the new section. With the new PR, we will move this into a separate README.md in the hyperpod-eks directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broken training llama with PyTorch FSDP example on SMHP

2 participants