[TOC]

  1. Title: Reward Model Ensembles Help Mitigate Overoptimization
  2. Author: Thomas Coste et. al.
  3. Publish Year: 10 Mar 2024
  4. Review Date: Thu, May 9, 2024
  5. url: arXiv:2310.02743v2

Summary of paper

image-20240509140808445

Motivation

Contribution

Some key terms

Overoptimization

Label noises

Best of N Sampling

image-20240509144229870

Method: reward model ensemble

image-20240509144928868

image-20240509144959686

image-20240509145206670

Results

for PPO, with a small KL penalty coefficient of 0.01 ($\beta$), WCO and UWO both successfully prevent overoptimization.

Summary

Use ensemble of reward models