Skip to content

why kl = nan when grpo train? #704

@uilstong

Description

@uilstong

when I post-train like blow code, why the kl is nan always? Even if I explicitly add ref_model in GRPOTrainer, it still doesn't work.
and why loss is negative?
the logs like:

Step Training Loss reward reward_std kl entropy
1 -0.283600 13.437500 21.810942 nan 0
2 -0.149100 6.931818 14.526671 nan No Log
3 -0.110500 6.250000 17.677670 nan No Log
4 -0.014500 6.422414 12.048451 nan No Log
Step Training Loss reward reward_std completions / mean_length completions / min_length completions / max_length completions / clipped_ratio completions / mean_terminated_length completions / min_terminated_length completions / max_terminated_length kl entropy rewards / format_reward / mean rewards / format_reward / std rewards / sorted_events_reward / mean rewards / sorted_events_reward / std rewards / score_reward / mean rewards / score_reward / std
1 -0.283600 13.437500 21.810942 1115.437500 253.000000 1600.000000 0.500000 630.875000 253.000000 1585.000000 nan 0 3.125000 4.787136 3.750000 8.062258 6.562500 19.036697
2 -0.149100 6.931818 14.526671 1203.250000 235.000000 1600.000000 0.625000 542.000000 235.000000 1594.000000 nan No Log 1.250000 3.415650 2.500000 6.831301 3.181818 12.727274
3 -0.110500 6.250000 17.677670 1316.687500 255.000000 1600.000000 0.750000 466.750000 255.000000 1078.000000 nan No Log 0.625000 2.500000 1.250000 5.000000 4.375000 17.500000
4 -0.014500 6.422414 12.048451 1173.500000 131.000000 1600.000000 0.687500 235.199997 131.000000 468.000000 nan No Log 0.000000 0.000000 2.500000 6.831301 3.922414 11.039421


from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    learning_rate=1e-5,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.01,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    num_generations=8,  # Decrease if out of memory
    max_prompt_length=max_prompt_length,
    max_completion_length=max_seq_length - max_prompt_length,
    max_grad_norm=0.1,
    # report_to="wandb",
    output_dir="/root/megrez-tmp/grpo_outputs2",
    overwrite_output_dir=True,
    # push_to_hub=False,
    # hub_model_id=new_model_id,
    # hub_strategy="every_save",
    save_strategy="steps",
    save_steps=50,
    save_total_limit=1,
    num_train_epochs=3,
)
trainer = GRPOTrainer(
# the model&ref_model is load by unsloth
    model=model,
    ref_model=ref_model,  # 👈 关键!必须加!
    processing_class=tokenizer,
    reward_funcs=[
        format_reward,
        sorted_events_reward,
        score_reward,
    ],
    args=training_args,
    train_dataset=ds,
    callbacks=[swanlab_callback],
)
trainer.train()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions