[TOC]

  1. Title: Internally Rewarded Reinforcement Learning
  2. Author: Mengdi Li et. al.
  3. Publish Year: 2023 PMLR
  4. Review Date: Wed, May 8, 2024
  5. url: https://proceedings.mlr.press/v202/li23ax.html

Summary of paper

image-20240508150740997

Motivation

  • the author studied a class o RL problem where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy (parallel training on both the policy and the reward model)
  • this leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning , and conversely, an under-optimized policy impedes discriminator learning
  • we call this learning setting Internally Rewarded RL (IRRL) as the reward is not provided directly by the environment but internally by the discriminator.

Contribution

  • proposed the clipped linear reward function. Results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance.
  • we formulate a class of RL problems as IRRL, and formulate the inherent issues of noisy rewards that leads to an unstable training loop in IRRL
  • we empirically characterize the noise in the discriminator and derive the effect of the reward function in reducing the bias of the estimated reward and the variance of reward noise from an underdeveloped discriminator
    • Comment: the author tried to express the bias and variance of reward noises in Taylor approximation
  • propose clipped linear reward function

Some key terms

Simultaneous optimization causes suboptimal training

image-20240508190033406

AIM

  • in this work, we seek to solve this issue by reducing the impact of reward noise, which is challenging due to the unavailability of an oracle discriminator whose posterior probability can reflect the information sufficiency for discrimination.

Define discriminator

$\tau \in (\mathcal S \times \mathcal A)^n$ ($n \in \mathbb N$ is the trajectory length)

the discriminator $q_\phi(y \mid \tau)$ computes the probability of label $y$ being the cause of trajectory $\tau$.

  • it is because $y$ is not accessible to the agent during policy training, thus we measure the alignment the goal $y$ with the collected $\tau$ .

Hard Attention example is one instance of IRRL

image-20240508214452118

  • I think Actor Critic is somehow similar to this scenario but the author did not mention it….

Mutual Information maximization

from deir paper:

image-20240508223518590

image-20240508223746437 image-20240508224624531

from this paper:

image-20240508224652788

in this equation, $p(y\mid \tau)$ is the oracle posterior probability that reflects the information sufficiency of observation ($\tau$) for discrimination. It can be interpreted as being generated by an oracle discriminator, a conceptual term utilized for the theoretical formulation.

Extend this mutual information to reward design and policy training

image-20240508230250906

discriminator training

The standard cross-entropy loss for training a classifier would be: $$ \mathbb{E}{\tau \sim \pi{\theta}, y \sim p(y)} \left[ p(y \mid \tau) \log q_{\phi}(y \mid \tau) \right], $$ but we drop $ p(y \mid \tau) $ by assuming it to be 1

why:

  • The simplification to drop $ p(y \mid \tau) $ implicitly assumes that for every trajectory $ \tau $, there is a direct, deterministic relationship to a label $ y $. This is equivalent to assuming $ p(y \mid \tau) = 1 $ for the sampled $ y $ and $ \tau $. In practice, this means assuming that the trajectory $ \tau $ contains all necessary information to unequivocally determine $ y $.
  • then it means that we do not care about uncertainty

Reward hacking = current LRM setting where the language reward model is trained beforehand

see the paper for more details

Generalized Reward and increasing function

image-20240508233726517

image-20240508233741733

INTUITION: if we make g as linear, we have these nth derivative in Taylor approximation of the reward noise becomes 0,

image-20240508233931313

CLIPPED

image-20240508234548316

Results

image-20240508234604709

Summary

  • think about reward noise, think about Taylor approximation of the reward noise representation, and then make it Linear to reduce the noise!

Potential future work

go check our theory to see if we can make it linear?

not really applicable to rewards signals that do not consider “log” but pure $p(y\mid \tau)$

but it contains a further $p(y)$, so maybe we can use this clipped reward signal $max(p(y \mid \tau) - p(y), 0)$ to compare with the pure $p(y \mid r)$