Xuezhou_zhang Robust Policy Gradient Against Strong Data Corruption 2021

[TOC]

Summary of paper

the author utilised a SVD-denoising technique to identify and remove the possible reward perturbations
this approach gives a robust RL algorithm

Limitation

This approach only solve the attack perturbation that is not consistent. (i.e. not stealthy)

Policy gradient methods

a popular class of RL methods among practitioners, as they are amenable to parametric policy classes, resilient to modelling assumption mismatches

Practicability of the existing works on robust RL

the majority of these work focuses on the setting of tabular MDPs and cannot be applied to real-world RL problems that have large state and action space and require function approximation.

policy gradient methods can be viewed as a stochastic gradient ascent method

Conceptually, policy gradient methods can be viewed as a stochastic gradient ascent method, where each iteration can be simplified as $\theta^{(t+1)} = \theta^{(t)} + g^{(t)} $
where $g$ is the gradient step that ideally points in the direction of fastest policy improvement. Assuming that $g^{(t)}$ is a good estimate of the gradient direction, then a simple attack strategy is try to perturb $g^{(t)}$ to point in the $-g^{(t)}$ direction, in which case the policy , rather than improving, will deteriorate as learning proceed.

Citation

WHY RL agent need robustness

In fact, data corruption can be a larger threat in the RL paradigm than in traditional supervised learning, because supervised learning is often applied in a controlled environment where data are collected and cleaned by highly-skilled data scientists and domain experts, whereas RL agents are developed to learn in the wild using raw feedbacks from the environment
- ref: Zhang, Xuezhou, et al. “Robust policy gradient against strong data corruption.” International Conference on Machine Learning. PMLR, 2021.

we can use the explanation