[TOC]

  1. Title: Robust Policy Gradient Against Strong Data Corruption
  2. Author: Xuezhou Zhang et. al.
  3. Publish Year: 2021
  4. Review Date: Tue, Dec 27, 2022

Summary of paper

Abstract

image-20221227203806030

Contribution

  • the author utilised a SVD-denoising technique to identify and remove the possible reward perturbations
  • this approach gives a robust RL algorithm

Limitation

  • This approach only solve the attack perturbation that is not consistent. (i.e. not stealthy)

Some key terms

Policy gradient methods

  • a popular class of RL methods among practitioners, as they are amenable to parametric policy classes, resilient to modelling assumption mismatches

Practicability of the existing works on robust RL

  • the majority of these work focuses on the setting of tabular MDPs and cannot be applied to real-world RL problems that have large state and action space and require function approximation.

policy gradient methods can be viewed as a stochastic gradient ascent method

  • Conceptually, policy gradient methods can be viewed as a stochastic gradient ascent method, where each iteration can be simplified as $\theta^{(t+1)} = \theta^{(t)} + g^{(t)} $
  • where $g$ is the gradient step that ideally points in the direction of fastest policy improvement. Assuming that $g^{(t)}$ is a good estimate of the gradient direction, then a simple attack strategy is try to perturb $g^{(t)}$ to point in the $-g^{(t)}$ direction, in which case the policy , rather than improving, will deteriorate as learning proceed.

Major comments

Citation

WHY RL agent need robustness

  • In fact, data corruption can be a larger threat in the RL paradigm than in traditional supervised learning, because supervised learning is often applied in a controlled environment where data are collected and cleaned by highly-skilled data scientists and domain experts, whereas RL agents are developed to learn in the wild using raw feedbacks from the environment
    • ref: Zhang, Xuezhou, et al. “Robust policy gradient against strong data corruption.” International Conference on Machine Learning. PMLR, 2021.

Potential future work

we can use the explanation