Skip to content

Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge#1054

Open
nkumaraws wants to merge 3 commits intomainfrom
feature/verl-grpo-gptoss-g5-clean
Open

Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge#1054
nkumaraws wants to merge 3 commits intomainfrom
feature/verl-grpo-gptoss-g5-clean

Conversation

@nkumaraws
Copy link
Copy Markdown
Contributor

Summary

  • Add complete veRL GRPO (Group Relative Policy Optimization) recipe for training openai/gpt-oss-20b (20B MoE) on g5.12xlarge (4x A10G 24GB GPUs) with HyperPod EKS
  • Custom multilingual language compliance reward function — model must reason and answer in the same language as the user's question
  • Full evaluation pipeline with vLLM batch inference (50-question test set, 5 languages)

Architecture

  • 6 nodes: 3 GPU workers (12 GPUs total) + 1 Ray head (CPU only) + 2 spare
  • veRL + FSDP2: Full CPU offload (offload_policy=True) for both params and optimizer on 24GB GPUs
  • vLLM inline inference: gpu_memory_utilization=0.45, enforce_eager=True (no CUDA graphs)
  • Ray orchestration via KubeRay on HyperPod EKS

Training Results

Validated on HyperPod EKS cluster (trl-gptoss-eks), 80 steps:

Metric Value
Steps completed 80
Final val reward 5.65 / 6.0
Step time ~2.5 min
Total training time ~3.5 hours
Checkpoint size ~117 GB (FSDP shards)

Evaluation (GRPO vs SFT baseline):

  • GRPO: 98% reasoning correct, 80% answer correct
  • SFT baseline: 96% reasoning correct, 74% answer correct
  • GRPO improved answer language compliance by +6%

Files

File Description
Dockerfile Updated with gpt-oss-20b dependencies
README.md Expanded with GRPO recipe docs, memory optimization, troubleshooting
recipe/run_gptoss_grpo.sh Training launcher (ray job submit)
recipe/language_reward.py Custom reward function (language detection + scoring)
recipe/evaluate_gptoss.py 50-question vLLM evaluation across 5 languages
recipe/evaluate_gptoss.sh Evaluation wrapper script
setup/env_vars.example g5.12xlarge + p5en.48xlarge config templates
setup/load_data_gptoss.sh HuggingFaceH4/Multilingual-Thinking data prep
setup/raycluster.yaml Updated KubeRay manifest (num-gpus=0 head, OOM killer disabled)

Testing

  • Trained 80 steps on 3x ml.g5.12xlarge workers (12 GPUs)
  • HF checkpoint saved and evaluated successfully
  • No OOM during training or checkpoint save
  • EFA networking validated (NCCL over OFI)

Related

  • Companion PR: OpenRLHF GRPO recipe (same task, different framework) — for framework comparison

Add a complete GRPO training recipe for the openai/gpt-oss-20b MoE model
(20B params, 32 experts) on g5.12xlarge instances with 4x A10G 24GB GPUs.
This required FSDP2 with CPU offloading — FSDP1 explicitly disables
CPUOffload for the actor role, making it impossible to train models
larger than ~10B on 24GB GPUs.

New files:
- recipe/run_gptoss_grpo.sh: GRPO training script with FSDP2,
  offload_policy, bf16, gpu_memory_utilization=0.6, enforce_eager,
  and detailed comments explaining why each parameter is required
- recipe/language_reward.py: Custom veRL reward function for
  multilingual language compliance scoring
- setup/load_data_gptoss.sh: Data preparation for
  HuggingFaceH4/Multilingual-Thinking dataset in veRL parquet format

Modified files:
- Dockerfile: Updated base image to verlai/verl:vllm011.latest,
  added langdetect and peft dependencies, added Megatron-LM
- README.md: Added g5 instance guidance, checkpoint management
  section, and troubleshooting for 7 common failure modes
- env_vars.example: Added g5.12xlarge reference config with
  critical notes (WORKER_MEMORY=150Gi, FI_EFA_USE_DEVICE_RDMA=0)
- raycluster.yaml: Parameterized CPU/memory/EFA settings via envsubst

Key lessons encoded in the recipe (from 11 OOM iterations):
- FSDP2 required (FSDP1 keeps actor on GPU)
- offload_policy=True (FSDP2-specific CPU offload)
- model_dtype=bf16 (veRL defaults actor to fp32)
- gpu_memory_utilization is fraction of TOTAL GPU, not just KV cache
- enforce_eager=True (CUDA graphs OOM on 24GB)
- save_freq=20 (117GB/checkpoint fills 1.2TB FSx in 9 steps at freq=1)
- nnodes=3 not 4 (head pod has no GPUs, causes NCCL hang)
Adds post-training evaluation workflow for veRL GRPO checkpoints:
- evaluate_gptoss.py: 50-question eval (10 prompts x 5 languages) using
  vLLM TP=4 batch inference with langdetect scoring
- evaluate_gptoss.sh: end-to-end wrapper that converts FSDP shards to HF
  format via verl.model_merger, then runs eval with optional SFT baseline

Results: GRPO step 80 achieved 98% reasoning / 80% answer language accuracy
vs SFT baseline 96% reasoning / 74% answer (+6% answer improvement).
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — veRL GRPO Recipe for gpt-oss-20b on g5.12xlarge

Clean, well-scoped PR. Three minor findings below (1 inline on the Dockerfile tag, 1 on pre-existing EFA version, 1 trailing newline). The code quality and documentation are excellent.


Things That Look Great

  • The run_gptoss_grpo.sh header is outstanding. The "WHY THESE SETTINGS?" section with 10 numbered lessons learned from real OOM debugging is exactly the kind of documentation that prevents users from repeating the same mistakes. Each point explains not just what to set but why, with specific memory numbers.
  • The troubleshooting section in the README covers 8 real failure modes (OOM during vLLM init, backward pass, NCCL hang, disk full, zombie jobs, fp32 default, expandable_segments, EFA config) — each with symptom, root cause, and fix.
  • Shell scripts consistently use set -xeuo pipefail — the strictest safety settings.
  • The env_vars.example dual-config pattern with commented p5en and active g5 blocks, each with inline explanations, makes it easy for users to switch between instance types.
  • The reward function (language_reward.py) follows the veRL compute_score API cleanly, with a well-documented scoring formula and clear separation of concerns.
  • The raycluster.yaml parameterization (replacing hardcoded resource values with ${WORKER_CPU}, ${WORKER_MEMORY}, ${FI_EFA_USE_DEVICE_RDMA}) makes the manifest work across instance types without forking.

Comment on lines +3 to +4
ARG TAG=vllm011.latest
FROM verlai/verl:${TAG}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base image tag vllm011.latest resembles a rolling tag

The .latest suffix is ambiguous — it could be a fixed tag that was named with "latest" in it, or a rolling tag that the verlai maintainers update in place. If it's the latter, builds become non-reproducible.

Could you confirm this is a fixed tag? If so, a comment noting that would help. Otherwise, I'd suggest switching to a more explicitly versioned tag or a digest-based pin.

Reference: CONTRIBUTING.md — "External dependencies must be pinned to a specific version or tag (no latest)."

FROM verlai/verl:${TAG}
ARG MEGATRON_LM_VERSION=core_v0.13.1
# EFA configuration
ARG OPEN_MPI_PATH=/opt/amazon/openmpi/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing EFA version 1.43.3 is below CI-enforced minimum

This line isn't changed by the PR, but since the Dockerfile is being modified, CI's version check may scan the file and flag EFA_VERSION=1.43.3 (minimum is 1.47.0). I'd suggest bumping it while the file is being touched to avoid a potential CI failure.

Suggested change
ARG OPEN_MPI_PATH=/opt/amazon/openmpi/
ENV EFA_VERSION=1.47.0

export LD_LIBRARY_PATH=/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
```

Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth. No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing trailing newline

The .editorconfig specifies insert_final_newline = true. Just needs a newline at the end of the file.

Suggested change
Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth.
Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits

- Bump EFA_VERSION from 1.43.3 to 1.47.0 (CI minimum)
- Add comment clarifying vllm011.latest is a fixed tag, not rolling
- Add missing trailing newline to README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants