[TOC]

  1. Title: On the Theory of Policy Gradient Methods Optimality Approximation and Distribution Shift 2020
  2. Author: Alekh Agarwal et. al.
  3. Publish Year: 14 Oct 2020
  4. Review Date: Wed, Dec 28, 2022

Summary of paper

Motivation

image-20221228143829438

Contribution

Some key terms

basic theoretical convergence questions

  1. if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class)
  2. how they cope with approximation error due to using a restricted class of parametric policies
  3. how they cope with approximation error their finite sample behaviour.

tabular policy parameterisation

  1. there is one parameter per state-action pair so the policy class is complete in that it contains the optimal policy

function approximation

  1. we have a restricted class or parametric policies which may not contain the globally optimal policy.

convergence rates (IMPORTANT)

excess risk

Apprixmation error

non-stationary policy

Concentrability coefficient

  1. Concentrability ensures that the ratio between the induced state-action distribution of any non-stationary policy and the state-action distribution in the batch data is upper bounded by a constant, called the concentrability coefficient.

overview of approximate methods

The performance difference lemma

The distribution mismatch coefficient

  1. we often characterise the difficulty of the exploration problem faced by our policy optimisation algorithms when maximising the objective $V^\pi(\mu)$ through the following notion of distribution mismatch coefficient
  2. 越大说明 explore 很差, the hardness of the exploration problem is captured through the distribution mismatch coefficient.
  3. image-20221229212507801
  4. discounted state visitation distribution $d_{s_0}^\pi(s)$
    • image-20221229210454833
  5. Given a policy $\pi$ and measures $\rho, \mu \in \Delta(S)$, we refer to $||\frac{d_\rho^\pi}{\mu}||_\infty$, $\rho$ distribution of starting states
    1. $\mu$ is fitting state distribution, (initial) state distribution

Potential future work