[TOC]
- Title: Reward Model Ensembles Help Mitigate Overoptimization
- Author: Thomas Coste et. al.
- Publish Year: 10 Mar 2024
- Review Date: Thu, May 9, 2024
- url: arXiv:2310.02743v2
Summary of paper
Motivation
- however, as imperfect representation of the “true” reward, these learned reward models are susceptible to over-optimization.
Contribution
- the author conducted a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specially worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization
- the author additionally extend the setup to include 25% label noise to better mirror real-world conditions
- For PPO, ensemble-based conservative optimization always reduce overoptimization and outperforms single reward model optimization
Some key terms
Overoptimization
- a phenomenon in which policy optimization appears to be making progress according to the learned reward model, but in reality begins to regress with respect to the true reward function
Label noises
- In the real-world RLHF setup, in which agreement rates among human annotators are typically between 60% - 75% (Ziegler et al., 2019; Stiennon et al., 2020; Dubois et al., 2023).
Best of N Sampling
Method: reward model ensemble
Results
for PPO, with a small KL penalty coefficient of 0.01 ($\beta$), WCO and UWO both successfully prevent overoptimization.
Summary
Use ensemble of reward models