Ruben_majadas Disturbing Reinforcement Learning Agents With Corrupted Rewards 2021

[TOC]

Summary of paper

recent works have shown how the performance of RL algorithm decreases under the influence of soft changes in the reward function.
However, little work has been done about how sensitive these disturbances are depending on the aggressiveness of the attack and the learning learning exploration strategy.
it chooses a subclass of MDPs: episodic, stochastic goal-only rewards MDPs

it demonstrated that smoothly crafting adversarial rewards are able to mislead the learner
the policy that is learned using low exploration probability values is more robust to corrupt rewards.
- (though this conclusion seems valid only for the proposed experiment setting)
the agent is completely lost with attack probabilities higher that than p=0.4

deterministic goal only reward MDP

a reward is only given if a goal state is reached.
many decision task are modelled with goal-only rewards from classical control problems such as Mountain Car or the Cart Pole
the goal only reward setting is a special case for sparse reward setting.

goal of adversary

produce the maximum deterioration in the learner policy performing the minimum number of attacks.

citation

the attack on reward function has received very little attention
1. ref: Majadas, Rubén, Javier García, and Fernando Fernández. “Disturbing reinforcement learning agents with corrupted rewards.” arXiv preprint arXiv:2102.06587 (2021).
2. ref: Jingkang Wang, Yang Liu, and Bo Li, ‘Reinforcement learning with perturbed rewards’, CoRR, abs/1810.01032, (2018)
there are no studies about how sensitive the learning process is depending on the aggressiveness of reward perturbations and the exploration strategy.
1. ref: Majadas, Rubén, Javier García, and Fernando Fernández. “Disturbing reinforcement learning agents with corrupted rewards.” arXiv preprint arXiv:2102.06587 (2021).

limitation on the setting

this consider with small and low-dimensional state space
lack the consideration of sparse reward setting with high dimensional large state space
this perturbation is based on the probability and lack the focus on the false positive rewards.
the perturbation is limited to only changing the sign of the true rewards.
A fixed attack probability was used in the experiments to test how reward attacks affect agents with varying exploration rates.
this experiment is like punish the good movements that lead to the goal.