GDPO vs. GRPO on tool calling RL training (GDPO Implementation based on verl)

Installation

Please install torch, vllm and ray according to your own environment configuration.

# install torch
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip install vllm==0.6.3
pip install ray

Please further install the verl in the current project and flash attention.

# verl
cd verl
pip install -e .

# flash attention 2
pip install flash-attn --no-build-isolation

Please first clone the ToolRL repo then copy the data to this folder

git clone git@github.com:qiancheng0/ToolRL.git
cp -r ToolRL/dataset verl-GDPO/

Before starting, configure your API keys and Hugging Face cache path:

export WANDB_API_KEY="Your API KEY"
export HF_TOKEN="Your API KEY"
export HF_HOME="YOU HF CACHE ADDRESS"

For GRPO and GDPO training, please specify the configuration in train_gdpo.sh and train_grpo.sh

bash train_gdpo.sh # For GDPO Training
bash train_grpo.sh # For GRPO Training

FYI. training Qwen2.5-1.5B-Instruct on a single Node with 8xA100 takes about an hour to finish.

Please see line 175 in verl-GDPO/verl/trainer/ppo/ray_trainer.py