[TOC]

  1. Title: PC-PG Policy Cover Directed Exploration for Provable Policy Gradient Learning
  2. Author: Alekh Agarwal et. al.
  3. Publish Year:
  4. Review Date: Wed, Dec 28, 2022

Summary of paper

Motivation

image-20221228144306599

  • The primary drawback of direct policy gradient methods is that, by being local in nature, they fail to adequately explore the environment.
  • In contrast, while model-based approach and Q-learning directly handle exploration through the use of optimism.

Contribution

  • Policy Cover-Policy Gradient algorithm (PC-PG), a direct, model-free, policy optimisation approach which addresses exploration through the use of a learned ensemble of policies, the latter provides a policy cover over the state space.
    • the use of a learned policy cover address exploration, and also address what is the catastrophic forgetting problem in policy gradient approaches (which use reward bonuses);
    • the on-policy algorithm, where approximation errors due to model mispecification amplify (see [Lu et al., 2018] for discussion)

Some key terms

suffering from sparse reward

  • The assumptions in these works imply that the state space is already well-explored. Conversely, without such coverage (and, say, with sparse rewards), policy gradients often suffer from the vanishing gradient problem.

original objective function and coverage of state space

image-20221229224603580

wider coverage objective

image-20221229224833293

iterative algorithm PC-PG

  • the idea is to successively improve both the current policy $\pi$ and the coverage distribution
  • the algorithm starts with some policy $\pi^0$, and works in episodes.
  • image-20221229233551464
  • a bonus bn in order to encourage the algorithm to find a policy πn+1 which covers a novel part of the state-action space
  • image-20221229234549382

Potential future work

  • so exploration essentially means we need to have a good state coverage for our training trajectory so that convergence to optimum is guaranteed.