Borja_ibarz Reward Learning From Human Preferences and Demonstrations in Atari 2018

[TOC]

Title: Reward learning from human preferences and demonstractions in Atari
Author: Borja Ibarz et. al.
Publish Year: 2018
Review Date: Nov 2021

Summary of paper

This needs to be only 1-3 sentences, but it demonstrates that you understand the paper and, moreover, can summarize it more concisely than the author in his abstract.

The author proposed a method that uses human expert’s annotation rather than extrinsic reward from the environment to guide the reinforcement learning.


the proposed trainnig algorithm

two feedback channels are provided:

Demonstrations: several trajectories of human behavour on the task
Preferences: the human compare pairwise short trajectory segments of the agent’s behaviour and prefer those that are closer to the intended goal.

When training the policy, they use TD error to train.

The training objective for the agent’s policy is the cost function $$ J(Q) = J_{PDDQ_n}(Q) + \lambda_2J_E(Q) + \lambda_3J_{L2}(Q) $$

the first is the prioritized dueling double Q-loss,
the second is the large-margin supervised loss, applied only to expert demonstrations
the third is the regularization loss

When training the reward function, the model is trained by predicting the probability of preferring a segment $\sigma_1$ over the other. This is trained by minimizing the cross-entropy loss between the prediction and the actual juedgement labels provided by the human experts.

The author claimed that this method outperform the imitation learning where the agent is trained by only the demonstrations from the human experts.

Summary of paper#

Summary of paper