[TOC]

  1. Title: Reward learning from human preferences and demonstractions in Atari
  2. Author: Borja Ibarz et. al.
  3. Publish Year: 2018
  4. Review Date: Nov 2021

Summary of paper

This needs to be only 1-3 sentences, but it demonstrates that you understand the paper and, moreover, can summarize it more concisely than the author in his abstract.

The author proposed a method that uses human expert’s annotation rather than extrinsic reward from the environment to guide the reinforcement learning.

image-20211128140909982
the proposed trainnig algorithm

two feedback channels are provided:

  1. Demonstrations: several trajectories of human behavour on the task
  2. Preferences: the human compare pairwise short trajectory segments of the agent’s behaviour and prefer those that are closer to the intended goal.

When training the policy, they use TD error to train.

The training objective for the agent’s policy is the cost function $$ J(Q) = J_{PDDQ_n}(Q) + \lambda_2J_E(Q) + \lambda_3J_{L2}(Q) $$

When training the reward function, the model is trained by predicting the probability of preferring a segment $\sigma_1$ over the other. This is trained by minimizing the cross-entropy loss between the prediction and the actual juedgement labels provided by the human experts.

The author claimed that this method outperform the imitation learning where the agent is trained by only the demonstrations from the human experts.