Skip to content

Latest commit

 

History

History
12 lines (6 loc) · 419 Bytes

File metadata and controls

12 lines (6 loc) · 419 Bytes

Efficient Agentic LLM

Value Function Estimation

  1. Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference ICLR 2025. Paper

    Qining Zhang, Lei Ying

    Moti: Reward function construction bottleneck: RLHF -> DPO -> GRPO.

    Design: Directly apply policy-gradient through ZO-based value function estimation.