1. Title: Reward Model Ensembles Help Mitigate Overoptimization
  2. Author: Thomas Coste et. al.
  3. Publish Year: 10 Mar 2024
  4. Review Date: Thu, May 9, 2024
  5. url: arXiv:2310.02743v2

Summary of paper



  • however, as imperfect representation of the “true” reward, these learned reward models are susceptible to over-optimization.


  • the author conducted a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specially worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization
  • the author additionally extend the setup to include 25% label noise to better mirror real-world conditions
  • For PPO, ensemble-based conservative optimization always reduce overoptimization and outperforms single reward model optimization

Some key terms


  • a phenomenon in which policy optimization appears to be making progress according to the learned reward model, but in reality begins to regress with respect to the true reward function
  • image-20240509144746392

Label noises

  • In the real-world RLHF setup, in which agreement rates among human annotators are typically between 60% - 75% (Ziegler et al., 2019; Stiennon et al., 2020; Dubois et al., 2023).

Best of N Sampling


Method: reward model ensemble





for PPO, with a small KL penalty coefficient of 0.01 ($\beta$), WCO and UWO both successfully prevent overoptimization.


Use ensemble of reward models