[TOC]

  1. Title: Failure Modes of Learning Reward Models for LLMs and other Sequence Models
  2. Author: Silviu Pitis
  3. Publish Year: ICML workshop 2023
  4. Review Date: Fri, May 10, 2024
  5. url: https://openreview.net/forum?id=NjOoxFRZA4&noteId=niZsZfTPPt

Summary of paper

image-20240510222642292

C3. Preference cannot represented as numbers

image-20240510222758050

M1. rationality level of human preference

image-20240510223017797

3.2, if the condition/context changes, the preference may change rapidly, and this cannot reflect on the reward machine

image-20240510224002585

A2. Preference should be expressed with respect to state-policy pairs, rather than just outcomes

Example with Texas Hold’em: The author uses an example from poker to illustrate these concepts. In the example, a player holding a weaker hand (72o) wins against a stronger hand (AA) after both commit to large bets pre-flop. Traditional reward modeling would prefer the successful trajectory of the weaker hand due to the positive outcome. However, a rational analysis (ignoring stochastic outcomes) would prefer the decision-making associated with the stronger hand (AA), even though it lost, as it’s typically the better strategy.

Preference Ambiguity in LLM Tool Usage: When applying these concepts to large language models (LLMs), similar ambiguities arise. Should a model prefer trajectories where a risky action led to a correct outcome, or should it also consider what might have gone wrong (counterfactuals)?

3.3. Reward misgeneralization