[TOC]

  1. Title: Online Decision Transformer
  2. Author: Qinqing Zheng
  3. Publish Year: Feb 2022
  4. Review Date: Mar 2022

Summary of paper

Motivation

the author proposed online Decision transformer (ODT), an RL algorithm based on sequence modelling that blends offline pretraining with online fine-tuning in a unified framework.

ODT builds on the decision transformer architecture previously introduced for offline RL

quantify exploration

compared to DT, they shifted from deterministic to stochastic policies for defining exploration objectives during the online phase. They quantify exploration via the entropy of the policy similar to max-ent RL frameworks.

behaviour cloning term

adding a behaviour cloning term to offline RL methods allows the porting of off-policy RL algorithm to the offline setting with minimal changes.

offline learning and online fine-tuning

image-20220322160850770

the policy is extracted via a behaviour cloning step that avoid of out of distribution actions.

some improvements on the offline-online settings

  1. Lee et al. (2021) tackles the offline-online setting with a balanced replay scheme and an ensemble of Q functions to maintain conservatism during offline training.
  2. Lu et al. (2021) improves upon AWAC (Nair et al., 2020), which exhibits collapse during the online fine tuning stage, by incorporating positive sampling and exploration during the online stage.
    • the author claimed that positive sampling and exploration are naturally embedded in ODT method

why offline trajectories has limitations

offline trajectories might not have high return and cover only a limited part of the state space.

modifications from decision transformer

  1. learn a stochastic policy (a Gaussian multivariate distribution with a diagonal covariance matrix to model the action distribution conditioned on states and RTGs)
  2. quantify exploration via the policy entropy

Algorithm

image-20220322185822226

Some key terms

offline RL

an agent is trained to autoregressively maximize the likelihood of trajectories in the offline dataset.

policies learned via offline RL are limited by the quality of the training dataset and need to be finetuned to the task of interest via online interactions.

transformer for RL

it focuses on predictive modelling of action sequences conditioned on a task specification (target goal or returns) as opposed to explicitly learning Q-functions or policy gradients.

off-policy vs on-policy vs offline reinforcement learning

the process of reinforcement learning involves iteratively collecting data by interacting with the environment. this data is also referred as experiences.

all these methods fundamentally differ in how this data (collection of experiences) is generated

On-policy RL

  • typically the experience are collected using the latest learned policy, and then using that experience to improve the policy.
  • the policy pi_k is updated with data collected by pi_k itself
  • example: SARSA, PPO, TRPO

Off-policy RL

  • in the classical off-policy setting, the agent’s experience is appended to a data buffer (also called replay buffer)
  • and each policy pi_k collects additional data, such that the replay buffer is composed of sample from pi_0, pi_1,… to pi_k, and all of this data is used to train an updated new policy pi_k+1.
  • image-20220322145244966

Offline RL

  • offline RL: those utilise previously collected data, without additional online data collection.

bootstrap method

The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples.

off-policy bootstrapping error accumulation

https://arxiv.org/pdf/1906.00949.pdf

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution (out of distribution actions), and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, the author study the setting where the off-policy experience is fixed and there is no further interaction with the environment. the author identified bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator.

return to go (RTG)

return to go of a trajectory $\tau$ at timestep t,

image-20220322165342190

is the sum of future reward from that timestep.