when I post-train like blow code, why the kl is nan always? Even if I explicitly add ref_model in GRPOTrainer, it still doesn't work.
and why loss is negative?
the logs like:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
learning_rate=1e-5,
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.01,
lr_scheduler_type="cosine",
optim="paged_adamw_8bit",
logging_steps=1,
per_device_train_batch_size=8,
gradient_accumulation_steps=1,
num_generations=8, # Decrease if out of memory
max_prompt_length=max_prompt_length,
max_completion_length=max_seq_length - max_prompt_length,
max_grad_norm=0.1,
# report_to="wandb",
output_dir="/root/megrez-tmp/grpo_outputs2",
overwrite_output_dir=True,
# push_to_hub=False,
# hub_model_id=new_model_id,
# hub_strategy="every_save",
save_strategy="steps",
save_steps=50,
save_total_limit=1,
num_train_epochs=3,
)
trainer = GRPOTrainer(
# the model&ref_model is load by unsloth
model=model,
ref_model=ref_model, # 👈 关键!必须加!
processing_class=tokenizer,
reward_funcs=[
format_reward,
sorted_events_reward,
score_reward,
],
args=training_args,
train_dataset=ds,
callbacks=[swanlab_callback],
)
trainer.train()
when I post-train like blow code, why the kl is nan always? Even if I explicitly add ref_model in
GRPOTrainer, it still doesn't work.and why loss is negative?
the logs like: