Posts

Chloe_ching_yun_hsu Revisiting Design Choices in Proximal Policy Optimisation 2020

[TOC] Title: Revisiting Design Choices in Proximal Policy Optimisation Author: Chloe Ching-Yun Hsu et. al. Publish Year: 23 Sep 2020 Review Date: Wed, Dec 28, 2022 Summary of paper Motivation Contribution on discrete action space with sparse high rewards, standard PPO often gets stuck at suboptimal actions. Why analyze the reason fort these failure modes and explain why they are not exposed by standard benchmarks In summary, our study suggests that Beta policy parameterization and KL-regularized objectives should be reconsidered for PPO, especially when alternatives improves PPO in all settings. The author proved the convergence guarantee for PPO-KL penalty version, as it inherits convergence guarantees of mirror descent for policy families that are closed under mixture Some key terms design choices ...

James_queeney Generalized Proximal Policy Optimisation With Sample Reuse 2021

[TOC] Title: Generalized Proximal Policy Optimisation With Sample Reuse 2021 Author: James Queeney et. al. Publish Year: 29 Oct 2021 Review Date: Wed, Dec 28, 2022 Summary of paper Motivation it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while off-policy methods make more efficient use of data through sample reuse. Contribution in this work, we combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for off-policy setting, and connect these bounds to the clipping mechanism used in PPO this motivate an off-policy version of the popular algorithm that we call GePPO. we demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the competing goals of stability and sample efficiency Some key terms sample complexity ...

Lun_wang Backdoorl Backdoor Attack Against Competitive Reinforcement Learning 2021

[TOC] Title: BackdooRL Backdoor Attack Against Competitive Reinforcement Learning 2021 Author: Lun Wang et. al Publish Year: 12 Dec 2021 Review Date: Wed, Dec 28, 2022 Summary of paper Motivation in this paper, we propose BACKDOORL, a backdoor attack targeted at two player competitive reinforcement learning systems. first the adversary agent has to lead the victim to take a series of wrong actions instead of only one to prevent it from winning. Additionally, the adversary wants to exhibit the trigger action in as few steps as possible to avoid detection. Contribution we propose backdoorl, the first backdoor attack targeted at competitive reinforcement learning systems. The trigger is the action of another agent in the environment. We propose a unified method to design fast-failing agent for different environment We prototype BACKDOORL and evaluate it in four environments. The results validate the feasibility of backdoor attacks in competitive environment We study the possible defenses for backdoorl. The results show that fine-tuning cannot completely remove the backdoor. Some key terms backdoorl workflow ...

Sandy_huang Adversarial Attacks on Neural Network Policies 2017

[TOC] Title: Adversarial Attacks on Neural Network Policies Author: Sandy Huang et. al. Publish Year: 8 Feb 2017 Review Date: Wed, Dec 28, 2022 Summary of paper Motivation in this work, we show adversarial attacks are also effective when targeting neural network policies in reinforcement learning. Specifically, we show existing adversarial example crafting techniques can be used to significantly degrade test-time performance of trained policies. Contribution we characterise the degree of vulnerability across tasks and training algorithm, for a subclass of adversarial example attacks in white-box and black-box settings. ...

Yinglun_xu Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning 2022

[TOC] Title: Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning Author: Yinglun Xu et. al. Publish Year: 30 May 2022 Review Date: Tue, Dec 27, 2022 Summary of paper Motivation we study data poisoning attacks on online deep reinforcement learning (DRL) where the attacker is oblivious to the learning algorithm used by the agent and does not necessarily have full knowledge of the environment. we instantiate our framework to construct several attacks which only corrupts the rewards for a small fraction of the total training timesteps and make the agent learn a low performing policy Contribution result show that the reward attack efficiently poison agent learning with a variety of SOTA DRL algorithm such as DQN, PPO our attack can work on model-free DRL algorithm for all popular learning paradigms, and only assume the learning algorithm to be efficient. large enough reward poisoning attack in the right direction is able to disrupt the DRL algorithm. limitation ...

Young_wu Reward Poisoning Attacks on Offline Multi Agent Reinforcement Learning 2022

[TOC] Title: Reward Poisoning Attacks on Offline Multi Agent Reinforcement Learning Author: Young Wu et. al. Publish Year: 1 Dec 2022 Review Date: Tue, Dec 27, 2022 Summary of paper Motivation Contribution unlike attacks on single-agent RL, we show that the attacker can install the target poilcy as a Markov Perfect Dominant Strategy Equilibrium (MPDSE), which rational agents are guaranteed to follow. This attack can be significantly cheaper than separate single-agent attacks. Limitation ...

Xuezhou_zhang Robust Policy Gradient Against Strong Data Corruption 2021

[TOC] Title: Robust Policy Gradient Against Strong Data Corruption Author: Xuezhou Zhang et. al. Publish Year: 2021 Review Date: Tue, Dec 27, 2022 Summary of paper Abstract Contribution the author utilised a SVD-denoising technique to identify and remove the possible reward perturbations this approach gives a robust RL algorithm Limitation This approach only solve the attack perturbation that is not consistent. (i.e. not stealthy) Some key terms Policy gradient methods ...

Kiarash_banihashem Defense Against Reward Poisoning Attacks in Reinforcement Learning 2021

[TOC] Title: Defense Against Reward Poisoning Attacks in Reinforcement Learning Author: Kiarash Banihashem et. al. Publish Year: 20 Jun 2021 Review Date: Tue, Dec 27, 2022 Summary of paper Motivation our goal is to design agents that are robust against such attacks in terms of the worst-case utility w.r.t. the true unpoisoned rewards while computing their policies under the poisoned rewards. Contribution we formalise this reasoning and characterize the utility of our novel framework for designing defense policies. In summary, the key contributions include ...

Amin_rakhsha Reward Poisoning in Reinforcement Learning Attacks Against Unknown Learners in Unknown Environments 2021

[TOC] Title: Reward Poisoning in Reinforcement Learning Attacks Against Unknown Learners in Unknown Environments Author: Amin Rakhsha et. al. Publish Year: 16 Feb 2021 Review Date: Tue, Dec 27, 2022 Summary of paper Motivation Our attack makes minimum assumptions on the prior knowledge of the environment or the learner’s learning algorithm. most of the prior work makes strong assumptions on the knowledge of adversary – it often assumed that the adversary has full knowledge of the environment or the agent’s learning algorithm or both. under such assumptions, attack strategies have been proposed that can mislead the agent to learn a nefarious policy with minimal perturbation to the rewards. Contribution We design a novel black-box attack, U2, that can provably achieve a near-matching performance to the SOTA white-box attack, demonstrating the feasibility of reward poisoning even in the most challenging black-box setting. limitation ...

Xuezhou_zhang Adaptive Reward Poisoning Attacks Against Reinforcement Learning 2020

[TOC] Title: Adaptive Reward Poisoning Attacks Against Reinforcement Learning Author: Xuezhou Zhang et. al. Publish Year: 22 Jun, 2020 Review Date: Tue, Dec 27, 2022 Summary of paper Motivation Non-adaptive attacks have been the focus of prior works. However, we show that under mild conditions, adaptive attacks can achieve the nefarious policy in steps polynomial in state-space size $|S|$ whereas non-adaptive attacks require exponential steps. Contribution we provide a lower threshold below which reward-poisoning attack is infeasible and RL is certified to be safe. similar to this paper, it shows that reward attack has its limit we provide a corresponding upper threshold above which the attack is feasible. we characterise conditions under which such attacks are guaranteed to fail (thus RL is safe), and vice versa in the case where attack is feasible, we provide upper bounds on the attack cost in the processing of achieving bad poliy we show that effective attacks can be found empirically using deep RL techniques. Some key terms feasible attack category ...

Anindya_sarkar Reward Delay Attacks on Deep Reinforcement Learning 2022

[TOC] Title: Reward Delay Attacks on Deep Reinforcement Learning Author: Anindya Sarkar et. al. Publish Year: 8 Sep 2022 Review Date: Mon, Dec 26, 2022 Summary of paper Motivation we present novel attacks targeting Q-learning that exploit a vulnerability entailed by this assumption by delaying the reward signal for a limited time period. We evaluate the efficacy of the proposed attacks through a series of experiments. Contribution our first observation is that reward-delay attacks are extremely effective when the goal for the adversarial is simply to minimise reward. we find that some mitigation method remains insufficient to ensure robustness to attacks that delay, but preserve the order, of rewards. Conclusion ...

Proximal Policy Optimisation Explained Blog

[TOC] Title: Proximal Policy Optimisation Explained Blog Author: Xiao-Yang Liu; DI engine Publish Year: May 4, 2021 Review Date: Mon, Dec 26, 2022 Highly recommend reading this blog https://lilianweng.github.io/posts/2018-04-08-policy-gradient/ https://zhuanlan.zhihu.com/p/487754664 Difference between on-policy and off-policy For on-policy algorithms, they update the policy network based on the transitions generated by the current policy network. The critic network would make a more accurate value-prediction for the current policy network in common environments. For off-policy algorithms, they allow to update the current policy network using the transitions from old policies. Thus, the old transitions could be reutilized, as shown in Fig. 1 the points are scattered on trajectories that are generated by different policies, which improves the sample efficiency and reduces the total training steps. Question: is there a way to improve the sample efficiency of on-policy algorithms without losing their benefit. PPO solves the problem of sample efficiency by utilizing surrogate objectives to avoid the new policy changing too far from the old policy. The surrogate objective is the key feature of PPO since it both 1. regularizes the policy update and enables the 2. reuse of training data. Algorithm ...

Tom_everitt Reinforcement Learning With a Corrupted Reward Channel 2017

[TOC] Title: Reinforcement Learning With a Corrupted Reward Channel Author: Tom Everitt Publish Year: August 22, 2017 Review Date: Mon, Dec 26, 2022 Summary of paper Motivation we formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards Contribution two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed second, by using randomisation to blunt the agent’s optimisation, reward corruption can be partially managed under some assumption Limitation ...

Yunhan_huang Manipulating Reinforcement Learning Stealthy Attacks on Cost Signals 2020

[TOC] Title: Manipulating Reinforcement Learning Stealthy Attacks on Cost Signals Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals Author: Yunhan Huang et. al. Publish Year: 2020 Review Date: Sun, Dec 25, 2022 Summary of paper Motivation understand the impact of the falsification of cost signals on the convergence of Q-learning algorithm Contribution In Q-learning, we show that Q-learning algorithms converge under stealthy attacks and bounded falsifications on cost signals. and there is a robust region within which the adversarial attacks cannot achieve its objective. The robust region of the cost can be utilised by both offensive and defensive side. An RL agent can leverage the robust region to evaluate the robustness to malicious falsification. we provide conditions on the falsified cost which can mislead the agent to learn an adversary’s favoured policy. Some key terms Stealthy Attacks ...

Vincent_zhuang No Regret Reinforcement Learning With Heavy Tailed Rewards 2021

[TOC] Title: No-Regret Reinforcement Learning With Heavy Tailed Rewards Author: Vincent Zhuang et. al. Publish Year: 2021 Review Date: Sun, Dec 25, 2022 Summary of paper Motivation To the best of our knowledge, no prior work has considered our setting of heavy-tailed rewards in the MDP setting. Contribution We demonstrate that robust mean estimation techniques can be broadly applied to reinforcement learning algorithms (specifically confidence-based methods) in order to provably han- dle the heavy-tailed reward setting Some key terms Robust UCB algorithm ...

Wenshuai_zhao Towards Closing the Sim to Real Gap in Collaborative Multi Robot Deep Reinforcement Learning 2020

[TOC] Title: Towards Closing the Sim to Real Gap in Collaborative Multi Robot Deep Reinforcement Learning Author: Wenshuai Zhao et. al. Publish Year: 2020 Review Date: Sun, Dec 25, 2022 Summary of paper Motivation we introduce the effect of sensing, calibration, and accuracy mismatches in distributed reinforcement learning we discuss on how both the different types of perturbations and how the number of agents experiencing those perturbations affect the collaborative learning effort Contribution This is, to the best of our knowledge, the first work exploring the limitation of PPO in multi-robot systems when considering that different robots might be exposed to different environment where their sensors or actuators have induced errors ...

Jan_corazza Reinforcement Learning With Stochastic Reward Machines 2022

[TOC] Title: Reinforcement Learning With Stochastic Reward Machines Author: Jan Corazza et. al. Publish Year: AAAI 2022 Review Date: Sat, Dec 24, 2022 Summary of paper Motivation reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequence of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. to overcome this practical limitation, we introduce a novel type of reward machines called stochastic reward machines, and an algorithm for learning them. Contribution Discussing the handling of noisy reward for non-markovian reward function. limitation: the solution introduces multiple sub value function models, which is different from the standard RL algorithm. The work does not emphasise on the sample efficiency of the algorithm. Some key terms Reward machine ...

Oguzhan_dogru Reinforcement Learning With Constrained Uncertain Reward Function Through Particle Filtering 2022

[TOC] Title: Reinforcement Learning With Constrained Uncertain Reward Function Through Particle Filtering Author: Oguzhan Dogru et. al. Publish Year: July 2022 Review Date: Sat, Dec 24, 2022 Summary of paper Motivation this study consider a type of uncertainty, which is caused by the sensor that are utilised for reward function. When the noise is Gaussian and the system is linear Contribution this work used “particle filtering” technique to estimate the true reward function from the perturbed discrete reward sampling points. Some key terms Good things about the paper (one paragraph) Major comments Citation ...

Inaam_ilahi Challenges and Countermeasures for Adversarial Attacks on Reinforcement Learning 2022

[TOC] Title: Challenges and Countermeasures for Adversarial Attacks on Reinforcement Learning Author: Inaam Ilahi et. al. Publish Year: 13 Sep 2021 Review Date: Sat, Dec 24, 2022 Summary of paper Motivation DRL is susceptible to adversarial attacks, which precludes its use in real-life critical system and applications. Therefore, we provide a comprehensive survey that discusses emerging attacks on DRL-based system and the potential countermeasures to defend against these attacks. Contribution we provide the DRL fundamentals along with a non-exhaustive taxonomy of advanced DRL algorithms we present a comprehensive survey of adversarial attacks on DRL and their potential countermeasures we discuss the available benchmarks and metrics for the robustness of DRL finally, we highlight the open issues and research challenges in the robustness of DRL and introduce some potential research directions . Some key terms organisation of this article ...

Zuxin_liu on the Robustness of Safe Reinforcement Learning Under Observational Perturbations 2022

[TOC] Title: On the Robustness of Safe Reinforcement Learning Under Observational Perturbations Author: Zuxin Liu et. al. Publish Year: 3 Oct 2022 Review Date: Thu, Dec 22, 2022 Summary of paper Motivation While many recent safe RL methods with deep policies can achieve outstanding constraint satisfaction in noise-free simulation environment, such a concern regarding their vulnerability under adversarial perturbation has not been studies in the safe RL setting. Contribution we are the first to formally analyze the unique vulnerability of the optimal policy in safe RL under observational corruptions. We define the state-adversarial safe RL problem and investigate its fundamental properties. We show that optimal solutions of safe RL problems are theoretically vulnerable under observational adversarial attacks we show that existing adversarial attack algorithms focusing on minimizing agent rewards do not always work, and propose two effective attack algorithms with theoretical justifications – one directly maximise the constraint violation cost, and one maximise the task reward to induce a tempting but risky policy. Surprisingly, the maximum reward attack is very strong in inducing unsafe behaviors, both in theory and practice we propose an adversarial training algorithm with the proposed attackers and show contraction properties of their Bellman operators. Extensive experiments in continuous control tasks show that our method is more robust against adversarial perturbations in terms of constraint satisfaction. Some key terms Safe reinforcement learning definition ...

Ruben_majadas Disturbing Reinforcement Learning Agents With Corrupted Rewards 2021

[TOC] Title: Disturbing Reinforcement Learning Agents With Corrupted Rewards Author: Ruben Majadas et. al. Publish Year: Feb 2021 Review Date: Sat, Dec 17, 2022 Summary of paper Motivation recent works have shown how the performance of RL algorithm decreases under the influence of soft changes in the reward function. However, little work has been done about how sensitive these disturbances are depending on the aggressiveness of the attack and the learning learning exploration strategy. it chooses a subclass of MDPs: episodic, stochastic goal-only rewards MDPs Contribution it demonstrated that smoothly crafting adversarial rewards are able to mislead the learner the policy that is learned using low exploration probability values is more robust to corrupt rewards. (though this conclusion seems valid only for the proposed experiment setting) the agent is completely lost with attack probabilities higher that than p=0.4 Some key terms deterministic goal only reward MDP ...

Jingkang_wang Reinforcement Learning With Perturbed Rewards 2020

[TOC] Title: Reinforcement Learning With Perturbed Rewards Author: Jingkang Wang et. al. Publish Year: 1 Feb 2020 Review Date: Fri, Dec 16, 2022 Summary of paper Motivation this paper studies RL with perturbed rewards, where a technical challenge is to revert the perturbation process so that the right policy is learned. Some experiments are used to support the algorithm (i.e., estimate the confusion matrix and revert) using existing techniques from the supervised learning (and crowdsourcing) literature. Limitation reviewers had concerns over the scope / significance of this work, mostly about how the confusion matrix is learned. If this matrix is known, correcting reward perturbation is easy, and standard RL can be applied to the corrected rewards. Specifically, the work seems to be limited in two substantial ways, both related to how confusion matrix is learned the reward function needs to be deterministic majority voting requires the number of states to be finite the significance of this work is therefore limited to finite-state problems with deterministic rewards, which is quite restricted. overall, the setting studied here, together with a thorough treatment of an (even restricted) case, could make an interesting paper that inspires future work. However, the exact problem setting is not completely clear in the paper, and the limitation of the technical contribution is somewhat unclear. Contribution The SOTA PPO algorithm is able to obtain 84.6% and 80.8% improvements on average score for five Atari games, with error rates as 10% and 30% respectively Some key terms reward function is often perturbed ...

Jacob_andreas Language Models as Agent Models 2022

[TOC] Title: Language Models as Agent Models Author: Jacob Andreas Publish Year: 3 Dec 2022 Review Date: Sat, Dec 10, 2022 https://arxiv.org/pdf/2212.01681.pdf Summary of paper Motivation during training, LMs have access only to the text of the documents, with no direct evidence of the internal states of the human agent that produce them. (kind of hidden MDP thing) this is a fact often used to argue that LMs are incapable of modelling goal-directed aspects of human language production and comprehension. The author stated that even in today’s non-robust and error-prone models – LM infer and use representations of fine-grained communicative intensions and more abstract beliefs and goals. Despite that limited nature of their training data, they can thus serve as building blocks for systems that communicate and act intentionally. In other words, the author said that language model can be used to communicate intention of human agent, and hence it can be treated as a agent model. Contribution the author claimed that in the course of performing next-word prediction in context, current LMs sometimes infer inappropriate, partial representations of beliefs ,desires and intentions possessed by the agent that produced the context, and other agents mentioned within it. Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communicative actions. The high-level goals of this paper are twofold: first, to outline a specific sense in which idealised language models can function as models of agent belief, desires and intentions; second, to highlight a few cases in which existing models appear to approach this idealization (and describe the ways in which they still fall short) Training on text alone produces ready-made models of the map from agent states to text; these models offer a starting point for language processing systems that communicate intentionally. Some key terms Current language model is bad ...

Charlie_snell Context Aware Language Modeling for Goal Oriented Dialogue Systems 2022

[TOC] Title: Context Aware Language Modeling for Goal Oriented Dialogue Systems Author: Charlie Snell et. al. Publish Year: 22 Apr 2022 Review Date: Sun, Nov 20, 2022 Summary of paper Motivation while supervised learning with large language models is capable of producing realistic text, how to steer such responses towards completing a specific task without sacrificing language quality remains an open question. how can we scalably and effectively introduce the mechanisms of goal-directed decision making into end-to-end language models, to steer language generation toward completing specific dialogue tasks rather than simply generating probable responses. they aim to directly finetune language models in a task-aware manner such that they can maximise a give utility function. Contribution it seems like the manipulation of training dataset and also the auxiliary objective are the two main “innovations” of the model. Some key terms Dialogue ...

Sanchit_agarwal Building Goal Oriented Dialogue Systems With Situated Visual Context 2021

[TOC] Title: Building Goal Oriented Dialogue Systems With Situated Visual Context 2021 Author: Sanchit Agarwal et. al. Publish Year: 22 Nov 2021 Review Date: Sun, Nov 20, 2022 Summary of paper Motivation with the surge of virtual assistants with screen, the next generation of agents are required to also understand screen context in order to provide a proper interactive experience, and better understand users’ goals. So in this paper, they propose a novel multimodal conversational framework, where the agent’s next action and their arguments are derived jointly conditioned on the conversational and the visual context. The model can recognise visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity. Contribution propose a novel multimodal conversational system that considers screen context, in addition to dialogue context, while deciding the agent’s next action The proposed visual grounding model takes both metadata and images as input allowing it to reason over metadata and visual information Our solution encodes the user query and each visual entities and then compute the similarity between them. to improve the visual entity encoding, they introduced query guided attention and entity self-attention layers. collect the MTurk survey and also create a multimodal dialogue simulator Architecture ...