[TOC]

  1. Title: Reinforcement Learning With a Corrupted Reward Channel
  2. Author: Tom Everitt
  3. Publish Year: August 22, 2017
  4. Review Date: Mon, Dec 26, 2022

Summary of paper

Motivation

Contribution

Limitation

Some key terms

Inverse Reinforcement learning

true reward and (possibly corrupt) observed reward

Board racing game example

  1. In the boat racing game, the true reward may be a function of the agent’s final position in the race or the time it takes to complete the race, depending on the designers’ intentions. The reward corruption function $C$ increases the observed reward on the loop the agent found.

worst case regret

No Free Lunch Theorem

limited reward corruption assumption

Easy environment assumption

image-20221226184833660

image-20221226184845915

Major comments

Solution

  1. agents drawing from multiple sources of evidence are likely to be the safest, as they will mostly easily satisfy the conditions of Theorem 19 and 20. For example, humans simultaneously learn their values from pleasure / pain stimuli, watching other people act, listening to stories, as well as (parental) evaluation of different scenarios. Combining sources of evidence may also go some way towards managing reward corruptions beyong sensory corruption.
  2. randomness increases robustness: not all contexts allow the agent to get sufficiently rich data to overcome the reward corruption problem.
    1. the problem was that they got stuck on a particular value $\hat r^$ of the observed reward. If unlucky, $\hat r^$ was available in a corrupt state, in which case the CR agent may get no true reward. In other words, there were adversarial inputs where the CR agent performed poorly.
    2. a common way to protect against adversarial inputs is to use a randomised algorithm. Applied to RL and CRMDPs, this idea leads to quantilising agents – these agents instead randomly choose a state from a top quantile of high-reward states.

takeaways

  1. without simplifying assumptions, no agent can avoid the corrupted reward problem.
  2. Using the reward signal as evidence rather than optimisation target is no magic bullet, even under strong simplifying assumptions. Essentially, this is because the agent does not know the exact relation between the observed reward and the true reward. However, when the data enables sufficient crosschecking of rewards, agents can avoid the corrupt reward problem. Combining frameworks and providing the agent with different sources of data may often be the safest option. In other words, we need to have different reward signal sources so as to alleviate the corruption.
  3. in cases where sufficient crosschecking of rewards is not possible, quantilisation may improve robustness. Essentially, quantilisation prevents agents from overoptimising their objectives.

Potential future work