Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge#1054
Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge#1054
Conversation
Add a complete GRPO training recipe for the openai/gpt-oss-20b MoE model (20B params, 32 experts) on g5.12xlarge instances with 4x A10G 24GB GPUs. This required FSDP2 with CPU offloading — FSDP1 explicitly disables CPUOffload for the actor role, making it impossible to train models larger than ~10B on 24GB GPUs. New files: - recipe/run_gptoss_grpo.sh: GRPO training script with FSDP2, offload_policy, bf16, gpu_memory_utilization=0.6, enforce_eager, and detailed comments explaining why each parameter is required - recipe/language_reward.py: Custom veRL reward function for multilingual language compliance scoring - setup/load_data_gptoss.sh: Data preparation for HuggingFaceH4/Multilingual-Thinking dataset in veRL parquet format Modified files: - Dockerfile: Updated base image to verlai/verl:vllm011.latest, added langdetect and peft dependencies, added Megatron-LM - README.md: Added g5 instance guidance, checkpoint management section, and troubleshooting for 7 common failure modes - env_vars.example: Added g5.12xlarge reference config with critical notes (WORKER_MEMORY=150Gi, FI_EFA_USE_DEVICE_RDMA=0) - raycluster.yaml: Parameterized CPU/memory/EFA settings via envsubst Key lessons encoded in the recipe (from 11 OOM iterations): - FSDP2 required (FSDP1 keeps actor on GPU) - offload_policy=True (FSDP2-specific CPU offload) - model_dtype=bf16 (veRL defaults actor to fp32) - gpu_memory_utilization is fraction of TOTAL GPU, not just KV cache - enforce_eager=True (CUDA graphs OOM on 24GB) - save_freq=20 (117GB/checkpoint fills 1.2TB FSx in 9 steps at freq=1) - nnodes=3 not 4 (head pod has no GPUs, causes NCCL hang)
Adds post-training evaluation workflow for veRL GRPO checkpoints: - evaluate_gptoss.py: 50-question eval (10 prompts x 5 languages) using vLLM TP=4 batch inference with langdetect scoring - evaluate_gptoss.sh: end-to-end wrapper that converts FSDP shards to HF format via verl.model_merger, then runs eval with optional SFT baseline Results: GRPO step 80 achieved 98% reasoning / 80% answer language accuracy vs SFT baseline 96% reasoning / 74% answer (+6% answer improvement).
KeitaW
left a comment
There was a problem hiding this comment.
Review — veRL GRPO Recipe for gpt-oss-20b on g5.12xlarge
Clean, well-scoped PR. Three minor findings below (1 inline on the Dockerfile tag, 1 on pre-existing EFA version, 1 trailing newline). The code quality and documentation are excellent.
Things That Look Great
- The
run_gptoss_grpo.shheader is outstanding. The "WHY THESE SETTINGS?" section with 10 numbered lessons learned from real OOM debugging is exactly the kind of documentation that prevents users from repeating the same mistakes. Each point explains not just what to set but why, with specific memory numbers. - The troubleshooting section in the README covers 8 real failure modes (OOM during vLLM init, backward pass, NCCL hang, disk full, zombie jobs, fp32 default, expandable_segments, EFA config) — each with symptom, root cause, and fix.
- Shell scripts consistently use
set -xeuo pipefail— the strictest safety settings. - The
env_vars.exampledual-config pattern with commented p5en and active g5 blocks, each with inline explanations, makes it easy for users to switch between instance types. - The reward function (
language_reward.py) follows the veRLcompute_scoreAPI cleanly, with a well-documented scoring formula and clear separation of concerns. - The raycluster.yaml parameterization (replacing hardcoded resource values with
${WORKER_CPU},${WORKER_MEMORY},${FI_EFA_USE_DEVICE_RDMA}) makes the manifest work across instance types without forking.
| ARG TAG=vllm011.latest | ||
| FROM verlai/verl:${TAG} |
There was a problem hiding this comment.
Base image tag vllm011.latest resembles a rolling tag
The .latest suffix is ambiguous — it could be a fixed tag that was named with "latest" in it, or a rolling tag that the verlai maintainers update in place. If it's the latter, builds become non-reproducible.
Could you confirm this is a fixed tag? If so, a comment noting that would help. Otherwise, I'd suggest switching to a more explicitly versioned tag or a digest-based pin.
Reference: CONTRIBUTING.md — "External dependencies must be pinned to a specific version or tag (no latest)."
| FROM verlai/verl:${TAG} | ||
| ARG MEGATRON_LM_VERSION=core_v0.13.1 | ||
| # EFA configuration | ||
| ARG OPEN_MPI_PATH=/opt/amazon/openmpi/ |
There was a problem hiding this comment.
Pre-existing EFA version 1.43.3 is below CI-enforced minimum
This line isn't changed by the PR, but since the Dockerfile is being modified, CI's version check may scan the file and flag EFA_VERSION=1.43.3 (minimum is 1.47.0). I'd suggest bumping it while the file is being touched to avoid a potential CI failure.
| ARG OPEN_MPI_PATH=/opt/amazon/openmpi/ | |
| ENV EFA_VERSION=1.47.0 |
| export LD_LIBRARY_PATH=/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH | ||
| ``` | ||
|
|
||
| Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth. No newline at end of file |
There was a problem hiding this comment.
Missing trailing newline
The .editorconfig specifies insert_final_newline = true. Just needs a newline at the end of the file.
| Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth. | |
| Without `NCCL_NET=ofi` and the correct `LD_LIBRARY_PATH`, NCCL silently falls back to TCP, giving much worse inter-node bandwidth. |
- Bump EFA_VERSION from 1.43.3 to 1.47.0 (CI minimum) - Add comment clarifying vllm011.latest is a fixed tag, not rolling - Add missing trailing newline to README.md
Summary
Architecture
offload_policy=True) for both params and optimizer on 24GB GPUsgpu_memory_utilization=0.45,enforce_eager=True(no CUDA graphs)Training Results
Validated on HyperPod EKS cluster (
trl-gptoss-eks), 80 steps:Evaluation (GRPO vs SFT baseline):
Files
DockerfileREADME.mdrecipe/run_gptoss_grpo.shrecipe/language_reward.pyrecipe/evaluate_gptoss.pyrecipe/evaluate_gptoss.shsetup/env_vars.examplesetup/load_data_gptoss.shsetup/raycluster.yamlTesting
Related