[TOC]

  1. Title: Offline Reinforcement Learning with Implicit Q-learning
  2. Author:Ilya Kostrikov et. al.
  3. Publish Year: 2021
  4. Review Date: Mar 2022

Summary of paper

Motivation

conflict in offline reinforcement learning

offline reinforcement learning requires reconciling two conflicting aims:

  1. learning a policy that improves over the behaviour policy (old policy) that collected the dataset
  2. while at the same time minimizing the deviation from the behaviour policy so as to avoid errors due to distributional shift (e.g., obtain out of distribution actions) -> the challenge is how to constrain those unseen actions to be in-distribution. (meaning there is no explicit Q-function for actions, and thus the issue of unseen action is gone)

all the previous solutions like 1. limit how far the new policy deviates from the behaviour policy and 2. assign low value to out of distribution actions impose a trade-off between how much the policy improve and how vulnerable it is to misestimation due to distributional shift.

So, what we want is to never query or estimate values for actions that were not seen in the data.

limitation of “single-step” approach:

single-step means they either use no value function at all, or lean the value function of the behaviour policy.

and this methods perform very poorly on more complex datasets that require combining parts of suboptimal trajectories.

expectile regression

their aim is not to estimate the distribution of values that results from stochastic transitions, but rather estimate expectiles of the state value function with respect to random actions.

so this aim is not to determine how Q-value can vary with different future outcomes, but how the Q-value can vary with different actions while averaging together future outcomes due to stochastic dynamics.

what is expectile regression

while quantile regression can be seen as a generalisation of median regression, expectile as alternative are a generalised form of mean regression.

image-20220324123105784

image-20220324123129312

anyway, expectile regression is a generalisation version of MSE

image-20220324123300534

why out of distribution action is bad for offline RL

out of distribution action a’ can produce erroneous values for Q_theta(s’, a’) in the temporal different error objective, often leading to overestimation as the policy is defined to maximise the (estimated) Q-value.

SARSA style optimal Q function

image-20220324114901894

$Q_\hat \theta$ is the target network (not trainable, controlled by behaviour policy)

$\pi_\beta$ is the behaviour policy

and this avoid issues with out-of-distribution actions because $\pi_\beta$ does not event choose out-of distribution actions.

How to improve, just do not consider out of distribution action when calculating Q values

image-20220324115759791

Moreover, apply expectile regression objective

image-20220324123348984

require additional policy extraction step

while this modified TD learning procedure learns an approximation to the optimal Q-function, it does not explicitly represent the corresponding policy, and therefore requires a separate policy step

as before, we aim to avoid using out of samples actions. therefore, we extract the policy using advantage weighted regression

image-20220324125956770

final algorithm

image-20220324130011910

Note that the policy does not influence the value function in any way, and therefore extraction could be performed either concurrently or after TD learning. Concurrent learning provides a way to use IQL with online fine-tuning.

Some key terms

distributional shift

distributional shift refers to the situation where training data distribution is not the same as the testing data distribution.

Q function

Q(s,a) is a measure of the overall expected reward assuming the Agent is in state s and perform action a, and then continues playing until the end of the episode following some policy pi

SARSA vs Q-learning

SARSA is more conservative than Q learning

image-20220324005734099

Q learning vs Dynamic programming

Dynamic programming creates optimal policies based on an already given model of its environment. Opposed to that is Q-Learning. It creates policies solely based on the rewards it receives by interacting with its environment.

Contribution

  1. the algorithm is computationally efficient: can perform 1M updates on one GTX1080 GPU in less than 20 minutes.
  2. simple to implement, requiring only minor modifications over a standard SARSA-like TD algorithm, and performing policy extraction with a simple weighted behaviour cloning procedure resembling supervised learning.