-
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference ICLR 2025. Paper
Qining Zhang, Lei Ying
Moti: Reward function construction bottleneck: RLHF -> DPO -> GRPO.
Design: Directly apply policy-gradient through ZO-based value function estimation.