Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge by nkumaraws · Pull Request #1054 · awslabs/awsome-distributed-training

nkumaraws · 2026-04-04T07:21:08Z

Summary

Add complete veRL GRPO (Group Relative Policy Optimization) recipe for training openai/gpt-oss-20b (20B MoE) on g5.12xlarge (4x A10G 24GB GPUs) with HyperPod EKS
Custom multilingual language compliance reward function — model must reason and answer in the same language as the user's question
Full evaluation pipeline with vLLM batch inference (50-question test set, 5 languages)

Architecture

6 nodes: 3 GPU workers (12 GPUs total) + 1 Ray head (CPU only) + 2 spare
veRL + FSDP2: Full CPU offload (offload_policy=True) for both params and optimizer on 24GB GPUs
vLLM inline inference: gpu_memory_utilization=0.45, enforce_eager=True (no CUDA graphs)
Ray orchestration via KubeRay on HyperPod EKS

Training Results

Validated on HyperPod EKS cluster (trl-gptoss-eks), 80 steps:

Metric	Value
Steps completed	80
Final val reward	5.65 / 6.0
Step time	~2.5 min
Total training time	~3.5 hours
Checkpoint size	~117 GB (FSDP shards)

Evaluation (GRPO vs SFT baseline):

GRPO: 98% reasoning correct, 80% answer correct
SFT baseline: 96% reasoning correct, 74% answer correct
GRPO improved answer language compliance by +6%

Files

File	Description
`Dockerfile`	Updated with gpt-oss-20b dependencies
`README.md`	Expanded with GRPO recipe docs, memory optimization, troubleshooting
`recipe/run_gptoss_grpo.sh`	Training launcher (ray job submit)
`recipe/language_reward.py`	Custom reward function (language detection + scoring)
`recipe/evaluate_gptoss.py`	50-question vLLM evaluation across 5 languages
`recipe/evaluate_gptoss.sh`	Evaluation wrapper script
`setup/env_vars.example`	g5.12xlarge + p5en.48xlarge config templates
`setup/load_data_gptoss.sh`	HuggingFaceH4/Multilingual-Thinking data prep
`setup/raycluster.yaml`	Updated KubeRay manifest (num-gpus=0 head, OOM killer disabled)

Testing

Trained 80 steps on 3x ml.g5.12xlarge workers (12 GPUs)
HF checkpoint saved and evaluated successfully
No OOM during training or checkpoint save
EFA networking validated (NCCL over OFI)

Add a complete GRPO training recipe for the openai/gpt-oss-20b MoE model (20B params, 32 experts) on g5.12xlarge instances with 4x A10G 24GB GPUs. This required FSDP2 with CPU offloading — FSDP1 explicitly disables CPUOffload for the actor role, making it impossible to train models larger than ~10B on 24GB GPUs. New files: - recipe/run_gptoss_grpo.sh: GRPO training script with FSDP2, offload_policy, bf16, gpu_memory_utilization=0.6, enforce_eager, and detailed comments explaining why each parameter is required - recipe/language_reward.py: Custom veRL reward function for multilingual language compliance scoring - setup/load_data_gptoss.sh: Data preparation for HuggingFaceH4/Multilingual-Thinking dataset in veRL parquet format Modified files: - Dockerfile: Updated base image to verlai/verl:vllm011.latest, added langdetect and peft dependencies, added Megatron-LM - README.md: Added g5 instance guidance, checkpoint management section, and troubleshooting for 7 common failure modes - env_vars.example: Added g5.12xlarge reference config with critical notes (WORKER_MEMORY=150Gi, FI_EFA_USE_DEVICE_RDMA=0) - raycluster.yaml: Parameterized CPU/memory/EFA settings via envsubst Key lessons encoded in the recipe (from 11 OOM iterations): - FSDP2 required (FSDP1 keeps actor on GPU) - offload_policy=True (FSDP2-specific CPU offload) - model_dtype=bf16 (veRL defaults actor to fp32) - gpu_memory_utilization is fraction of TOTAL GPU, not just KV cache - enforce_eager=True (CUDA graphs OOM on 24GB) - save_freq=20 (117GB/checkpoint fills 1.2TB FSx in 9 steps at freq=1) - nnodes=3 not 4 (head pod has no GPUs, causes NCCL hang)

Adds post-training evaluation workflow for veRL GRPO checkpoints: - evaluate_gptoss.py: 50-question eval (10 prompts x 5 languages) using vLLM TP=4 batch inference with langdetect scoring - evaluate_gptoss.sh: end-to-end wrapper that converts FSDP shards to HF format via verl.model_merger, then runs eval with optional SFT baseline Results: GRPO step 80 achieved 98% reasoning / 80% answer language accuracy vs SFT baseline 96% reasoning / 74% answer (+6% answer improvement).

KeitaW

Review — veRL GRPO Recipe for gpt-oss-20b on g5.12xlarge

Clean, well-scoped PR. Three minor findings below (1 inline on the Dockerfile tag, 1 on pre-existing EFA version, 1 trailing newline). The code quality and documentation are excellent.

Things That Look Great

The run_gptoss_grpo.sh header is outstanding. The "WHY THESE SETTINGS?" section with 10 numbered lessons learned from real OOM debugging is exactly the kind of documentation that prevents users from repeating the same mistakes. Each point explains not just what to set but why, with specific memory numbers.
The troubleshooting section in the README covers 8 real failure modes (OOM during vLLM init, backward pass, NCCL hang, disk full, zombie jobs, fp32 default, expandable_segments, EFA config) — each with symptom, root cause, and fix.
Shell scripts consistently use set -xeuo pipefail — the strictest safety settings.
The env_vars.example dual-config pattern with commented p5en and active g5 blocks, each with inline explanations, makes it easy for users to switch between instance types.
The reward function (language_reward.py) follows the veRL compute_score API cleanly, with a well-documented scoring formula and clear separation of concerns.
The raycluster.yaml parameterization (replacing hardcoded resource values with ${WORKER_CPU}, ${WORKER_MEMORY}, ${FI_EFA_USE_DEVICE_RDMA}) makes the manifest work across instance types without forking.

KeitaW · 2026-04-08T06:57:29Z

+ARG TAG=vllm011.latest
+FROM verlai/verl:${TAG}


Base image tag vllm011.latest resembles a rolling tag

The .latest suffix is ambiguous — it could be a fixed tag that was named with "latest" in it, or a rolling tag that the verlai maintainers update in place. If it's the latter, builds become non-reproducible.

Could you confirm this is a fixed tag? If so, a comment noting that would help. Otherwise, I'd suggest switching to a more explicitly versioned tag or a digest-based pin.

Reference: CONTRIBUTING.md — "External dependencies must be pinned to a specific version or tag (no latest)."

KeitaW · 2026-04-08T06:57:29Z

+FROM verlai/verl:${TAG}
+ARG MEGATRON_LM_VERSION=core_v0.13.1
 # EFA configuration
 ARG OPEN_MPI_PATH=/opt/amazon/openmpi/


Pre-existing EFA version 1.43.3 is below CI-enforced minimum

This line isn't changed by the PR, but since the Dockerfile is being modified, CI's version check may scan the file and flag EFA_VERSION=1.43.3 (minimum is 1.47.0). I'd suggest bumping it while the file is being touched to avoid a potential CI failure.

Suggested change

ARG OPEN_MPI_PATH=/opt/amazon/openmpi/

ENV EFA_VERSION=1.47.0

KeitaW · 2026-04-08T06:57:30Z

+export LD_LIBRARY_PATH=/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
+```
+
+Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth.


Missing trailing newline

The .editorconfig specifies insert_final_newline = true. Just needs a newline at the end of the file.

Suggested change

Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth.

Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth.

KeitaW

Few nits

- Bump EFA_VERSION from 1.43.3 to 1.47.0 (CI minimum) - Add comment clarifying vllm011.latest is a fixed tag, not rolling - Add missing trailing newline to README.md

nkumaraws added 2 commits April 4, 2026 00:20

nkumaraws mentioned this pull request Apr 4, 2026

feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge) #1053

Merged

KeitaW mentioned this pull request Apr 8, 2026

feat: add veRL GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge) #1052

Closed

KeitaW reviewed Apr 8, 2026

View reviewed changes

KeitaW requested changes Apr 8, 2026

View reviewed changes

fix: address PR review — EFA version bump, tag comment, trailing newline

3ff9aa1

- Bump EFA_VERSION from 1.43.3 to 1.47.0 (CI minimum) - Add comment clarifying vllm011.latest is a fixed tag, not rolling - Add missing trailing newline to README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge#1054

Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge#1054
nkumaraws wants to merge 3 commits intomainfrom
feature/verl-grpo-gptoss-g5-clean

nkumaraws commented Apr 4, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Apr 8, 2026

Uh oh!

KeitaW Apr 8, 2026

Uh oh!

KeitaW Apr 8, 2026

Uh oh!

KeitaW left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	ARG OPEN_MPI_PATH=/opt/amazon/openmpi/
	ENV EFA_VERSION=1.47.0

	Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth.
	Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth.

Conversation

nkumaraws commented Apr 4, 2026

Summary

Architecture

Training Results

Files

Testing

Related

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review — veRL GRPO Recipe for gpt-oss-20b on g5.12xlarge

Things That Look Great

Uh oh!

KeitaW Apr 8, 2026

Choose a reason for hiding this comment

Base image tag vllm011.latest resembles a rolling tag

Uh oh!

KeitaW Apr 8, 2026

Choose a reason for hiding this comment

Pre-existing EFA version 1.43.3 is below CI-enforced minimum

Uh oh!

KeitaW Apr 8, 2026

Choose a reason for hiding this comment

Missing trailing newline

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Base image tag `vllm011.latest` resembles a rolling tag