[TOC]
- Title: Robust Policy Gradient Against Strong Data Corruption
- Author: Xuezhou Zhang et. al.
- Publish Year: 2021
- Review Date: Tue, Dec 27, 2022
Summary of paper
Abstract
Contribution
- the author utilised a SVD-denoising technique to identify and remove the possible reward perturbations
- this approach gives a robust RL algorithm
Limitation
- This approach only solve the attack perturbation that is not consistent. (i.e. not stealthy)
Some key terms
Policy gradient methods
- a popular class of RL methods among practitioners, as they are amenable to parametric policy classes, resilient to modelling assumption mismatches
Practicability of the existing works on robust RL
- the majority of these work focuses on the setting of tabular MDPs and cannot be applied to real-world RL problems that have large state and action space and require function approximation.
policy gradient methods can be viewed as a stochastic gradient ascent method
- Conceptually, policy gradient methods can be viewed as a stochastic gradient ascent method, where each iteration can be simplified as $\theta^{(t+1)} = \theta^{(t)} + g^{(t)} $
- where $g$ is the gradient step that ideally points in the direction of fastest policy improvement. Assuming that $g^{(t)}$ is a good estimate of the gradient direction, then a simple attack strategy is try to perturb $g^{(t)}$ to point in the $-g^{(t)}$ direction, in which case the policy , rather than improving, will deteriorate as learning proceed.
Major comments
Citation
WHY RL agent need robustness
- In fact, data corruption can be a larger threat in the RL paradigm than in traditional supervised learning, because supervised learning is often applied in a controlled environment where data are collected and cleaned by highly-skilled data scientists and domain experts, whereas RL agents are developed to learn in the wild using raw feedbacks from the environment
- ref: Zhang, Xuezhou, et al. “Robust policy gradient against strong data corruption.” International Conference on Machine Learning. PMLR, 2021.
Potential future work
we can use the explanation