[TOC]

  1. Title: Internally Rewarded Reinforcement Learning
  2. Author: Mengdi Li et. al.
  3. Publish Year: 2023 PMLR
  4. Review Date: Wed, May 8, 2024
  5. url: https://proceedings.mlr.press/v202/li23ax.html

Summary of paper

image-20240508150740997

Motivation

Contribution

Some key terms

Simultaneous optimization causes suboptimal training

image-20240508190033406

AIM

Define discriminator

$\tau \in (\mathcal S \times \mathcal A)^n$ ($n \in \mathbb N$ is the trajectory length)

the discriminator $q_\phi(y \mid \tau)$ computes the probability of label $y$ being the cause of trajectory $\tau$.

Hard Attention example is one instance of IRRL

image-20240508214452118

Mutual Information maximization

from deir paper:

image-20240508223518590

image-20240508223746437 image-20240508224624531

from this paper:

image-20240508224652788

in this equation, $p(y\mid \tau)$ is the oracle posterior probability that reflects the information sufficiency of observation ($\tau$) for discrimination. It can be interpreted as being generated by an oracle discriminator, a conceptual term utilized for the theoretical formulation.

Extend this mutual information to reward design and policy training

image-20240508230250906

discriminator training

The standard cross-entropy loss for training a classifier would be: $$ \mathbb{E}{\tau \sim \pi{\theta}, y \sim p(y)} \left[ p(y \mid \tau) \log q_{\phi}(y \mid \tau) \right], $$ but we drop $ p(y \mid \tau) $ by assuming it to be 1

why:

Reward hacking = current LRM setting where the language reward model is trained beforehand

see the paper for more details

Generalized Reward and increasing function

image-20240508233726517

image-20240508233741733

INTUITION: if we make g as linear, we have these nth derivative in Taylor approximation of the reward noise becomes 0,

image-20240508233931313

CLIPPED

image-20240508234548316

Results

image-20240508234604709

Summary

Potential future work

go check our theory to see if we can make it linear?

not really applicable to rewards signals that do not consider β€œlog” but pure $p(y\mid \tau)$

but it contains a further $p(y)$, so maybe we can use this clipped reward signal $max(p(y \mid \tau) - p(y), 0)$ to compare with the pure $p(y \mid r)$