12 lines (6 loc) · 419 Bytes

Efficient Agentic LLM

Value Function Estimation

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference ICLR 2025. Paper

Qining Zhang, Lei Ying

Moti: Reward function construction bottleneck: RLHF -> DPO -> GRPO.

Design: Directly apply policy-gradient through ZO-based value function estimation.

⚡