[TOC]

  1. Title: Proximal Policy Optimisation Explained Blog
  2. Author: Xiao-Yang Liu; DI engine
  3. Publish Year: May 4, 2021
  4. Review Date: Mon, Dec 26, 2022

Difference between on-policy and off-policy

image-20221226195443427

  • For on-policy algorithms, they update the policy network based on the transitions generated by the current policy network. The critic network would make a more accurate value-prediction for the current policy network in common environments.
  • For off-policy algorithms, they allow to update the current policy network using the transitions from old policies. Thus, the old transitions could be reutilized, as shown in Fig. 1 the points are scattered on trajectories that are generated by different policies, which improves the sample efficiency and reduces the total training steps.

Question: is there a way to improve the sample efficiency of on-policy algorithms without losing their benefit.

  • PPO solves the problem of sample efficiency by utilizing surrogate objectives to avoid the new policy changing too far from the old policy. The surrogate objective is the key feature of PPO since it both 1. regularizes the policy update and enables the 2. reuse of training data.
  • image-20221226195751351
  • image-20221226200007957
  • image-20221226195936296

Algorithm

image-20221226200313414

explanation

image-20221226200425376

Generalized advantage estimator (GAE)

image-20221226201313870

total PPO loss

image-20221226201451409