Reinforcement Learning from Human Feedback: A Systematic Review
Journal of Artificial Intelligence Research • Vol. 12, No. 4
Abstract
This systematic review examines 147 studies on reinforcement learning from human feedback (RLHF) published between 2017 and 2024. We identify key methodological trends, open challenges in reward modeling, and propose a unified taxonomy for RLHF evaluation protocols. The review highlights convergence on Constitutional AI and debate-based approaches for scalable oversight.