Skip to content

Adding OpenEnv Wordle GRPO sample#1063

Open
allela-roy wants to merge 1 commit intoawslabs:mainfrom
allela-roy:main
Open

Adding OpenEnv Wordle GRPO sample#1063
allela-roy wants to merge 1 commit intoawslabs:mainfrom
allela-roy:main

Conversation

@allela-roy
Copy link
Copy Markdown
Contributor

@allela-roy allela-roy commented Apr 14, 2026

Purpose

Adding a sample for OpenEnv Wordle GRPO Training on SageMaker HyperPod (EKS)

Changes

New sample

Test Plan

Environment:

  • AWS Service: SageMaker HyperPod (EKS)
  • Instance type: ml.g6e.12xlarge (4x NVIDIA L40S, 48 GB each)
  • Number of nodes: 2

Test commands:

# 1. Set up environment variables
source env_vars

# 2. Build and push container image
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $REGISTRY
docker build -t ${REGISTRY}${IMAGE}:${TAG} .
docker push ${REGISTRY}${IMAGE}:${TAG}

# 3. Deploy Wordle environment
envsubst < kubernetes/openenv-wordle-env.yaml | kubectl apply -f -
kubectl wait --for=condition=Available deployment/openenv-wordle --timeout=300s

# 4. Verify Wordle env health
kubectl run test-env --rm -it --restart=Never --image=curlimages/curl -- \
  curl -s http://openenv-wordle:7860/health

# 5. Launch single-GPU colocate training
envsubst < kubernetes/train-grpo-wordle.yaml | kubectl apply -f -
kubectl wait --for=condition=Ready pod/grpo-wordle --timeout=300s

Test Results

Single-node colocate mode verified with 6+ training steps, 0 pod restarts, all 4 reward signals active:

Step 1 (epoch 0.011): loss=1.696e-32, reward=0.9385, reward_correct=0.2688, greens=0.1875, yellows=0.175, repetition=0.3073, step_time=152.9s
Step 2 (epoch 0.022): loss=3.387e-32, reward=0.9281, reward_correct=0.25,   greens=0.1531, yellows=0.2125, repetition=0.3125, step_time=153.0s
Step 3 (epoch 0.032): loss=6.355e-33, reward=0.887,  reward_correct=0.2641, greens=0.1906, yellows=0.1562, repetition=0.276,  step_time=154.4s
Step 4 (epoch 0.043): loss=-2.26e-32, reward=0.9156, reward_correct=0.2594, greens=0.1625, yellows=0.1969, repetition=0.2969, step_time=152.4s
Step 5 (epoch 0.054): loss=1.503e-32, reward=0.9208, reward_correct=0.2688, greens=0.1719, yellows=0.2094, repetition=0.2708, step_time=157.3s
Step 6 (epoch 0.065): loss=-6.63e-32, reward=1.013,  reward_correct=0.3063, greens=0.2062, yellows=0.2063, repetition=0.2943, step_time=151.1s

Pod status throughout training:

NAME          READY   STATUS    RESTARTS   AGE
grpo-wordle   1/1     Running   0          25m

Directory Structure

3.test_cases/
└── <framework>/                # e.g. pytorch, megatron, jax
    └── <library>/              # e.g. picotron, FSDP, megatron-lm
        └── <model>/            # e.g. SmolLM-1.7B (may be omitted for single-model cases)
            ├── Dockerfile      # Container / environment setup
            ├── README.md       # Overview, prerequisites, usage
            ├── slurm/          # Slurm-specific launch scripts
            ├── kubernetes/     # Kubernetes manifests
            └── hyperpod-eks/   # HyperPod EKS instructions
  • Top-level files (Dockerfile, README.md, training scripts, configs) cover general setup.
  • Subdirectories (slurm/, kubernetes/, hyperpod-eks/) contain service-specific launch instructions.
  • Not all service subdirectories are required — include only the ones relevant to your test case.

Checklist

  • [X ] I have read the contributing guidelines.
  • [ X] I am working against the latest main branch.
  • [X ] I have searched existing open and recently merged PRs to confirm this is not a duplicate.
  • [ X] The contribution is self-contained with documentation and scripts.
  • [X ] External dependencies are pinned to a specific version or tag (no latest).
  • [X ] A README is included or updated with prerequisites, instructions, and known issues.
  • [ X] New test cases follow the expected directory structure.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 1/5 — Structure & Repository Hygiene

Thanks for this contribution — the OpenEnv Wordle GRPO sample is a great addition to the TRL test cases. I have some findings across several categories; posting them as themed batches for easier navigation.

This batch covers image pinning and Dockerfile robustness.

@@ -0,0 +1,60 @@
FROM public.ecr.aws/hpc-cloud/nccl-tests:latest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base image uses :latest tag

Per repo conventions, container image tags must use fixed versions, never latest. This makes builds non-reproducible — a new push to nccl-tests:latest could silently break the build. I'd suggest pinning to a specific digest or tag. You can find available tags with:

aws ecr-public describe-image-tags --repository-name hpc-cloud/nccl-tests --region us-east-1

Reference: review-checklist, "Fixed version tags."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KeitaW , don't we want to use the latest since this contains the recommended versions of EFA, NCCL libs?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allela-roy it doesn't, does it? As of today (April 16th 2026), these are the latest versions on that docker image:

CUDA 12.8
EFA Installer 1.43.0
aws-ofi-nccl 1.16.3
libfabric 1.27.0

So, either you make sure you tag a specific version and on your Dockerfile you update the libraries, or rely on latest anyway and rebuild everything even if that means downgrading a library.

Either, it is a best practice, not a blocker. Chose your own way and do it, but make it reproducible and that it does not break in the near future.

- Schedulable
containers:
- name: wordle-env
image: registry.hf.space/burtenshaw-wordle:latest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment image uses :latest

registry.hf.space/burtenshaw-wordle:latest — if there's no versioned tag for this HuggingFace Space image, I'd suggest adding a comment noting that this is externally managed and may change without notice.

preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init container image uses :latest

I'd suggest pinning to a specific version:

Suggested change
matchExpressions:
image: curlimages/curl:8.12.1

export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=openenv-wordle-grpo
export TAG=latest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default TAG should not be latest

Combined with the :latest base image, this creates a double-unpinned build. I'd suggest defaulting to a version-based tag:

Suggested change
export TAG=latest
export TAG=v1.0

Comment on lines +37 to +47
src = p.read_text(); \
old = 'from transformers.utils.import_utils import _is_package_available'; \
new = 'from transformers.utils.import_utils import _is_package_available as _orig_is_pkg_avail\n\ndef _is_package_available(pkg_name, return_version=False):\n result = _orig_is_pkg_avail(pkg_name, return_version=return_version)\n if return_version:\n return result if isinstance(result, tuple) else (result, None)\n return result[0] if isinstance(result, tuple) else result'; \
p.write_text(src.replace(old, new))"

# Fix GRPOTrainer compatibility with transformers 5.x (warnings_issued attr)
# and vLLM 0.8.x (logprobs_mode kwarg not supported).
RUN python -c "\
import pathlib; \
p = pathlib.Path('/opt/miniconda3/lib/python3.12/site-packages/trl/trainer/grpo_trainer.py'); \
src = p.read_text(); \
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline TRL patches are fragile across versions

These python -c blocks patch TRL source files at exact paths using string replacement. If TRL updates these lines (even whitespace changes), the patches will silently fail to match and the str.replace() will be a no-op, leaving the bugs in place.

I'd suggest:

  1. Adding verification after each patch: python -c "assert 'patched_text' in open('/path/to/file').read()"
  2. Adding a comment noting which TRL issue/PR tracks the upstream fix so future maintainers know when the patches can be removed

or fix TRL directory (nice PR opportunity).

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR few nits.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 2/5 — Deployment Pipeline

This batch covers the multi-GPU manifest GPU allocation issue, envsubst usage, missing copyright header, and credential handling.

Comment on lines +93 to +95
- --host
- "0.0.0.0"
- --port
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trainer container requests 1 GPU but claims to use GPUs 1-3

The manifest header (lines 6-8) describes "GPU 0: vLLM server, GPU 1-3: FSDP training" and line 121 uses accelerate launch --num_processes 1. But Kubernetes GPU device plugin allocates GPUs at the container level — requesting 1 GPU here means the trainer gets exactly 1 GPU, not 3.

To achieve the described 1+3 split, I'd suggest:

Suggested change
- --host
- "0.0.0.0"
- --port
requests:
nvidia.com/gpu: 3
memory: "60Gi"

And updating accelerate launch --num_processes 1 on line 121 to --num_processes 3.

Alternatively, if the intent is really 1+1 (total 2 GPUs), the README section "4. Multi-GPU Training" and the manifest header comments should be updated to match.

This is the most significant functional issue I noticed.

Comment on lines +36 to +37
sagemaker.amazonaws.com/job-max-retry-count: "3"
spec:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$HF_TOKEN exposed as plain-text env var

After envsubst, this embeds the actual token in the manifest YAML. Anyone with kubectl get pod -o yaml access can read it. I'd suggest using a Kubernetes Secret instead:

Suggested change
sagemaker.amazonaws.com/job-max-retry-count: "3"
spec:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token

This applies to all manifests in this PR (train-grpo-wordle.yaml, train-grpo-wordle-multigpu.yaml, inference-wordle.yaml).

@@ -0,0 +1,60 @@
FROM public.ecr.aws/hpc-cloud/nccl-tests:latest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing copyright header

The Dockerfile is missing the standard repo copyright header. All other files in this PR include it. I'd suggest adding at the top:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

(The training script uses Apache 2.0 from the original HuggingFace source, which is fine for copied code — but the Dockerfile is original.)

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 3/5 — Infrastructure & NCCL Configuration

Shared memory mounting pattern.

Comment on lines +73 to +75
memory: "180Gi"
env:
- name: PYTORCH_CUDA_ALLOC_CONF
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/dev/shm mounted as hostPath instead of emptyDir with Memory medium

Using hostPath: /dev/shm shares the host's shared memory with the pod, which means multiple pods on the same node can interfere with each other. The preferred Kubernetes pattern is an isolated emptyDir:

Suggested change
memory: "180Gi"
env:
- name: PYTORCH_CUDA_ALLOC_CONF
- name: shmem
emptyDir:
medium: Memory
sizeLimit: 16Gi

This gives the pod its own isolated shared memory allocation. The same applies to the other manifests in this PR (train-grpo-wordle-multigpu.yaml, inference-wordle.yaml).

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 4/5 — Documentation Consistency & Training Code Quality

Dangling references, missing newline, and dependency pinning.

# Check checkpoints
kubectl exec fsx-storage-manager -- ls -la /fsx/checkpoints/wordle-grpo/
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README references fsx-storage-manager pod that isn't created by the PR

kubectl exec fsx-storage-manager -- ls -la /fsx/checkpoints/wordle-grpo/

There's no manifest or instructions for creating an fsx-storage-manager pod. A user following the README would get Error from server (NotFound). I'd suggest replacing with a command that uses the training pod itself:

Suggested change
kubectl exec grpo-wordle -- ls -la /fsx/checkpoints/wordle-grpo/


## YOUR GOAL

Solve the Wordle in as few guesses as possible by strategically using feedback to eliminate impossible words and narrow down the solution space efficiently. No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline at end of file

Per the repo's .editorconfig (insert_final_newline = true), all files should end with a newline character.

Comment on lines +7 to +11
accelerate>=0.27.0
peft>=0.9.0
deepspeed>=0.14.0
wandb
trl[vllm]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependencies use open-ended lower bounds without upper pins

Several dependencies use >= without upper bounds, which can pull incompatible major versions over time. I'd suggest adding upper bounds:

Suggested change
accelerate>=0.27.0
peft>=0.9.0
deepspeed>=0.14.0
wandb
trl[vllm]
transformers>=4.48.0,<5.0
datasets>=2.17.0,<3.0
accelerate>=0.27.0,<2.0
peft>=0.9.0,<1.0
deepspeed>=0.14.0,<1.0

Also note: trl[vllm] is installed here (pulling in a vllm version), then TRL is upgraded to 0.26.2 with --no-deps in the Dockerfile. The vllm version is therefore determined by whatever trl[vllm] pulled initially, not by TRL 0.26.2's constraints. This may be intentional but is worth documenting.


# Flash Attention (requires torch, must install separately with --no-build-isolation)
RUN pip install flash-attn>=2.5.0 --no-build-isolation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flash-attn has no upper bound pin

Flash Attention builds from source and is sensitive to CUDA/PyTorch version combinations. An unpinned upper bound could pull a version incompatible with torch 2.6.0. I'd suggest:

Suggested change
RUN pip install flash-attn>=2.5.0,<3.0 --no-build-isolation

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 5/5 — Things That Look Great

  • Excellent README structure. The pipeline architecture diagram, prerequisite list, step-by-step instructions with expected outputs, configuration reference table, and reward function documentation are all well-done. This is one of the more polished READMEs I've seen in a test case contribution.
  • OpenEnv is a compelling choice for demonstrating agentic RL. The client-server architecture (environment as a Kubernetes service, training as a separate pod) showcases real Kubernetes-native ML workflows. The "What is OpenEnv?" section does a great job explaining why this matters.
  • Two deployment modes (single-GPU colocate and multi-GPU server) give users flexibility. Colocate is great for quick iteration; server mode demonstrates a production-like split.
  • HyperPod-aware manifests. The tolerations, node selectors, health-check affinity, and auto-resume annotations follow HyperPod best practices consistently across all manifests.
  • The training script is well-structured. Clean separation between rollout logic (rollout_once), reward functions, and the main entrypoint. The argparse interface makes it easy to experiment.
  • Four distinct reward signals (correctness, greens, yellows, repetition) provide rich training signal and make the training dynamics interpretable from logs.
  • Copyright headers present on all shell scripts, YAML manifests, and the env_vars example.
  • The Wordle environment as a Deployment with readiness/liveness probes ensures the environment is healthy before training starts — much better than a bare pod.

Cross-cutting note: envsubst variable whitelist

The README instructs users to run envsubst < kubernetes/*.yaml | kubectl apply -f - without an explicit variable whitelist. Without a whitelist, envsubst substitutes every $VAR in the YAML. I'd suggest using explicit whitelists in the README commands, e.g.:

envsubst '$NAMESPACE $REGISTRY $IMAGE $TAG $HF_TOKEN $MODEL_NAME $VLLM_MODE $NUM_GPU_PER_NODE $NUM_GENERATIONS $GRADIENT_ACCUMULATION_STEPS $LEARNING_RATE $INSTANCE_TYPE $FSX_PVC' < kubernetes/train-grpo-wordle.yaml | kubectl apply -f -

Copy link
Copy Markdown
Contributor

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution — the README quality and overall structure are solid. A few items to address before we can merge:

1. HF_TOKEN plain-text exposure (Security — Blocking)

All Kubernetes manifests (train-grpo-wordle.yaml, train-grpo-wordle-multigpu.yaml, inference-wordle.yaml) pass HF_TOKEN as a plain-text value: field after envsubst processing. Anyone with kubectl get pod -o yaml access can read the token.

Please replace the plain-text value with a Kubernetes Secret reference, for example:

- name: HF_TOKEN
  valueFrom:
    secretKeyRef:
      name: hf-token
      key: token

And add instructions to the README for creating the secret:

kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN

2. Multi-GPU manifest: GPU allocation mismatch (Blocking)

The comments in train-grpo-wordle-multigpu.yaml describe:

GPU 0: Dedicated vLLM server
GPU 1-3: FSDP-sharded GRPO training

But both the vllm-server and trainer containers request only 1 GPU each (nvidia.com/gpu: 1), and the training command uses accelerate launch --num_processes 1.

This means only 2 GPUs are used total, not 4 as described. Could you clarify the intended configuration?

  • If the intent is 1 GPU for vLLM + 3 GPUs for FSDP: the trainer container should request nvidia.com/gpu: 3 and use --num_processes 3.
  • If the intent is 1 GPU for vLLM + 1 GPU for training: the comments and documentation should be updated to match.

3. Add upper bounds to dependency versions in requirements.txt

The current dependencies use open-ended >= bounds (e.g., transformers>=4.48.0). Given that the Dockerfile already patches TRL for transformers 5.x compatibility issues, unbounded dependencies are a real reproducibility risk.

Please add upper bounds, for example:

transformers>=4.48.0,<6.0
datasets>=2.17.0,<4.0
accelerate>=0.27.0,<2.0
peft>=0.9.0,<1.0
deepspeed>=0.14.0,<1.0

4. Missing copyright header in Dockerfile

The Dockerfile is the only file in this PR missing the standard copyright header. Please add the following two lines at the very top of the file (before the FROM line):

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

All other files in the PR already have this header.

5. Double-check env_vars.example for sensitive data

Please review the env_vars.example file and confirm it does not contain any real sensitive data such as AWS account IDs, FSx filesystem IDs, credentials, or tokens. The current placeholders look correct (e.g., <your-huggingface-token>, <your-hyperpod-cluster-name>), but please verify before merging.

6. fsx-storage-manager pod reference in README (Non-blocking)

The README references a pod called fsx-storage-manager in two places:

Checking checkpoints:

kubectl exec fsx-storage-manager -- ls -la /fsx/checkpoints/wordle-grpo/

Cleanup section:

kubectl delete pod grpo-wordle grpo-wordle-multigpu inference-wordle fsx-storage-manager 2>/dev/null

However, no manifest in this PR creates a fsx-storage-manager pod. Users following the README will get a "pod not found" error on these commands.

Could you clarify the intent here?

  • If fsx-storage-manager is a pod users are expected to have in their cluster already, please add a note explaining that.
  • If it was included by mistake, please remove it from the README commands and use the grpo-wordle pod instead (or another existing pod that has the FSx volume mounted).
  • If there should be a manifest for it, please add one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants