Perturbed Rewards

Tom_everitt Reinforcement Learning With a Corrupted Reward Channel 2017

[TOC] Title: Reinforcement Learning With a Corrupted Reward Channel Author: Tom Everitt Publish Year: August 22, 2017 Review Date: Mon, Dec 26, 2022 Summary of paper Motivation we formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards Contribution two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed second, by using randomisation to blunt the agent’s optimisation, reward corruption can be partially managed under some assumption Limitation ...

Vincent_zhuang No Regret Reinforcement Learning With Heavy Tailed Rewards 2021

[TOC] Title: No-Regret Reinforcement Learning With Heavy Tailed Rewards Author: Vincent Zhuang et. al. Publish Year: 2021 Review Date: Sun, Dec 25, 2022 Summary of paper Motivation To the best of our knowledge, no prior work has considered our setting of heavy-tailed rewards in the MDP setting. Contribution We demonstrate that robust mean estimation techniques can be broadly applied to reinforcement learning algorithms (specifically confidence-based methods) in order to provably han- dle the heavy-tailed reward setting Some key terms Robust UCB algorithm ...

Wenshuai_zhao Towards Closing the Sim to Real Gap in Collaborative Multi Robot Deep Reinforcement Learning 2020

[TOC] Title: Towards Closing the Sim to Real Gap in Collaborative Multi Robot Deep Reinforcement Learning Author: Wenshuai Zhao et. al. Publish Year: 2020 Review Date: Sun, Dec 25, 2022 Summary of paper Motivation we introduce the effect of sensing, calibration, and accuracy mismatches in distributed reinforcement learning we discuss on how both the different types of perturbations and how the number of agents experiencing those perturbations affect the collaborative learning effort Contribution This is, to the best of our knowledge, the first work exploring the limitation of PPO in multi-robot systems when considering that different robots might be exposed to different environment where their sensors or actuators have induced errors ...

Jan_corazza Reinforcement Learning With Stochastic Reward Machines 2022

[TOC] Title: Reinforcement Learning With Stochastic Reward Machines Author: Jan Corazza et. al. Publish Year: AAAI 2022 Review Date: Sat, Dec 24, 2022 Summary of paper Motivation reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequence of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. to overcome this practical limitation, we introduce a novel type of reward machines called stochastic reward machines, and an algorithm for learning them. Contribution Discussing the handling of noisy reward for non-markovian reward function. limitation: the solution introduces multiple sub value function models, which is different from the standard RL algorithm. The work does not emphasise on the sample efficiency of the algorithm. Some key terms Reward machine ...

Oguzhan_dogru Reinforcement Learning With Constrained Uncertain Reward Function Through Particle Filtering 2022

[TOC] Title: Reinforcement Learning With Constrained Uncertain Reward Function Through Particle Filtering Author: Oguzhan Dogru et. al. Publish Year: July 2022 Review Date: Sat, Dec 24, 2022 Summary of paper Motivation this study consider a type of uncertainty, which is caused by the sensor that are utilised for reward function. When the noise is Gaussian and the system is linear Contribution this work used “particle filtering” technique to estimate the true reward function from the perturbed discrete reward sampling points. Some key terms Good things about the paper (one paragraph) Major comments Citation ...

Inaam_ilahi Challenges and Countermeasures for Adversarial Attacks on Reinforcement Learning 2022

[TOC] Title: Challenges and Countermeasures for Adversarial Attacks on Reinforcement Learning Author: Inaam Ilahi et. al. Publish Year: 13 Sep 2021 Review Date: Sat, Dec 24, 2022 Summary of paper Motivation DRL is susceptible to adversarial attacks, which precludes its use in real-life critical system and applications. Therefore, we provide a comprehensive survey that discusses emerging attacks on DRL-based system and the potential countermeasures to defend against these attacks. Contribution we provide the DRL fundamentals along with a non-exhaustive taxonomy of advanced DRL algorithms we present a comprehensive survey of adversarial attacks on DRL and their potential countermeasures we discuss the available benchmarks and metrics for the robustness of DRL finally, we highlight the open issues and research challenges in the robustness of DRL and introduce some potential research directions . Some key terms organisation of this article ...

Zuxin_liu on the Robustness of Safe Reinforcement Learning Under Observational Perturbations 2022

[TOC] Title: On the Robustness of Safe Reinforcement Learning Under Observational Perturbations Author: Zuxin Liu et. al. Publish Year: 3 Oct 2022 Review Date: Thu, Dec 22, 2022 Summary of paper Motivation While many recent safe RL methods with deep policies can achieve outstanding constraint satisfaction in noise-free simulation environment, such a concern regarding their vulnerability under adversarial perturbation has not been studies in the safe RL setting. Contribution we are the first to formally analyze the unique vulnerability of the optimal policy in safe RL under observational corruptions. We define the state-adversarial safe RL problem and investigate its fundamental properties. We show that optimal solutions of safe RL problems are theoretically vulnerable under observational adversarial attacks we show that existing adversarial attack algorithms focusing on minimizing agent rewards do not always work, and propose two effective attack algorithms with theoretical justifications – one directly maximise the constraint violation cost, and one maximise the task reward to induce a tempting but risky policy. Surprisingly, the maximum reward attack is very strong in inducing unsafe behaviors, both in theory and practice we propose an adversarial training algorithm with the proposed attackers and show contraction properties of their Bellman operators. Extensive experiments in continuous control tasks show that our method is more robust against adversarial perturbations in terms of constraint satisfaction. Some key terms Safe reinforcement learning definition ...

Ruben_majadas Disturbing Reinforcement Learning Agents With Corrupted Rewards 2021

[TOC] Title: Disturbing Reinforcement Learning Agents With Corrupted Rewards Author: Ruben Majadas et. al. Publish Year: Feb 2021 Review Date: Sat, Dec 17, 2022 Summary of paper Motivation recent works have shown how the performance of RL algorithm decreases under the influence of soft changes in the reward function. However, little work has been done about how sensitive these disturbances are depending on the aggressiveness of the attack and the learning learning exploration strategy. it chooses a subclass of MDPs: episodic, stochastic goal-only rewards MDPs Contribution it demonstrated that smoothly crafting adversarial rewards are able to mislead the learner the policy that is learned using low exploration probability values is more robust to corrupt rewards. (though this conclusion seems valid only for the proposed experiment setting) the agent is completely lost with attack probabilities higher that than p=0.4 Some key terms deterministic goal only reward MDP ...

Jingkang_wang Reinforcement Learning With Perturbed Rewards 2020

[TOC] Title: Reinforcement Learning With Perturbed Rewards Author: Jingkang Wang et. al. Publish Year: 1 Feb 2020 Review Date: Fri, Dec 16, 2022 Summary of paper Motivation this paper studies RL with perturbed rewards, where a technical challenge is to revert the perturbation process so that the right policy is learned. Some experiments are used to support the algorithm (i.e., estimate the confusion matrix and revert) using existing techniques from the supervised learning (and crowdsourcing) literature. Limitation reviewers had concerns over the scope / significance of this work, mostly about how the confusion matrix is learned. If this matrix is known, correcting reward perturbation is easy, and standard RL can be applied to the corrected rewards. Specifically, the work seems to be limited in two substantial ways, both related to how confusion matrix is learned the reward function needs to be deterministic majority voting requires the number of states to be finite the significance of this work is therefore limited to finite-state problems with deterministic rewards, which is quite restricted. overall, the setting studied here, together with a thorough treatment of an (even restricted) case, could make an interesting paper that inspires future work. However, the exact problem setting is not completely clear in the paper, and the limitation of the technical contribution is somewhat unclear. Contribution The SOTA PPO algorithm is able to obtain 84.6% and 80.8% improvements on average score for five Atari games, with error rates as 10% and 30% respectively Some key terms reward function is often perturbed ...