Theodore_r_sumers How to Talk So Ai Will Learn 2022

[TOC] Title: How to talk so AI will learn: Instructions, descriptions, and autonomy Author: Theodore R. Sumers et. al. Publish Year: NeurIPS 2022 Review Date: Wed, Mar 15, 2023 url: https://arxiv.org/pdf/2206.07870.pdf Summary of paper Motivation yet today, we lack computational models explaining such language use Contribution To address this challenge, we formalise learning from language in a contextual bandit setting and ask how a human might communicate preferences over behaviours. (obtain intent (preference) from the presentation (behaviour)) we show that instructions are better in low-autonomy settings, but descriptions are better when the agent will need to act independently....

<span title='2023-03-15 21:09:32 +0800 +0800'>March 15, 2023</span>&nbsp;ยท&nbsp;3 min&nbsp;ยท&nbsp;591 words&nbsp;ยท&nbsp;Sukai Huang

Cheng_chi Diffusion Policy Visuomotor Policy Learning via Action Diffusion 2023

[TOC] Title: Diffusion Policy Visuomotor Policy Learning via Action Diffusion Author: Cheng Chi et. al. Publish Year: 2023 Review Date: Thu, Mar 9, 2023 url: https://diffusion-policy.cs.columbia.edu/diffusion_policy_2023.pdf Summary of paper Contribution introducing a new form of robot visuomotor policy that generates behaviour via a โ€œconditional denoising diffusion processโ€ on robot action space Some key terms Explicit policy learning this is like imitation learning Implicit policy aiming to minimise the estimation of the energy function learning this is like a standard reinforcement learning diffusion policy...

<span title='2023-03-09 19:36:17 +1100 AEDT'>March 9, 2023</span>&nbsp;ยท&nbsp;1 min&nbsp;ยท&nbsp;205 words&nbsp;ยท&nbsp;Sukai Huang

Tianjun_zhang the Wisdom of Hindsight Makes Language Models Better Instruction Followers 2023

[TOC] Title: The Wisdom of Hindsight Makes Language Models Better Instruction Followers Author: Tianjun Zhang et. al. Publish Year: 10 Feb 2023 Review Date: Thu, Mar 2, 2023 url: https://arxiv.org/pdf/2302.05206.pdf Summary of paper Motivation Reinforcement learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the pipeline for reward and value networks Contribution in this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner....

<span title='2023-03-02 19:06:35 +1100 AEDT'>March 2, 2023</span>&nbsp;ยท&nbsp;3 min&nbsp;ยท&nbsp;427 words&nbsp;ยท&nbsp;Sukai Huang

Alexander_nikulin Anti Exploration by Random Network Distillation 2023

[TOC] Title: Anti Exploration by Random Network Distillation Author: Alexander Nikulin et. al. Publish Year: 31 Jan 2023 Review Date: Wed, Mar 1, 2023 url: https://arxiv.org/pdf/2301.13616.pdf Summary of paper Motivation despite the success of Random Network Distillation (RND) in various domains, it was shown as not discriminative enough to be used as an uncertainty estimator for penalizing out-of-distribution actions in offline reinforcement learning ?? wait, why we want to penalizing out-of-distribution actions?...

<span title='2023-03-01 22:14:11 +1100 AEDT'>March 1, 2023</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;359 words&nbsp;ยท&nbsp;Sukai Huang

Edoardo_cetin Learning Pessimism for Reinforcement Learning 2023

[TOC] Title: Learning Pessimism for Reinforcement Learning Author: Edoardo Cetin et. al. Publish Year: 2023 Review Date: Wed, Mar 1, 2023 url: https://kclpure.kcl.ac.uk/portal/files/196848783/10977.CetinE.pdf Summary of paper Motivation Off-policy deep RL algorithms commonly compensate for overestimation bias during temporal difference learning by utilizing pessimistic estimates of the expected target returns Contribution we propose Generalised Pessimism Learning (GPL), a strategy employing a novel learnable penalty to enact such pessimism. In particular we propose to learn this penalty alongside the critic with dual TD-learning, a new procedure to estimate and minimise the magnitude of the target returns bias with trivial computational cost....

<span title='2023-03-01 21:02:25 +1100 AEDT'>March 1, 2023</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;222 words&nbsp;ยท&nbsp;Sukai Huang

Danijar_hafner Mastering Diverse Domains Through World Models 2023

[TOC] Title: Mastering Diverse Domains Through World Models Author: Danijar Hafner et. al. Publish Year: 10 Jan 2023 Review Date: Tue, Feb 7, 2023 url: https://www.youtube.com/watch?v=vfpZu0R1s1Y Summary of paper Motivation general intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but held back by the resources and knowledge required tune them for new task. Contribution we present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters....

<span title='2023-02-07 18:18:37 +1100 AEDT'>February 7, 2023</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;291 words&nbsp;ยท&nbsp;Sukai Huang

Alekh_agarwal PC-PG Policy Cover Directed Exploration for Provable Policy Gradient Learning 2020

[TOC] Title: PC-PG Policy Cover Directed Exploration for Provable Policy Gradient Learning Author: Alekh Agarwal et. al. Publish Year: Review Date: Wed, Dec 28, 2022 Summary of paper Motivation The primary drawback of direct policy gradient methods is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approach and Q-learning directly handle exploration through the use of optimism. Contribution Policy Cover-Policy Gradient algorithm (PC-PG), a direct, model-free, policy optimisation approach which addresses exploration through the use of a learned ensemble of policies, the latter provides a policy cover over the state space....

<span title='2022-12-28 14:39:25 +1100 AEDT'>December 28, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;271 words&nbsp;ยท&nbsp;Sukai Huang

Alekh_agarwal on the Theory of Policy Gradient Methods Optimality Approximation and Distribution Shift 2020

[TOC] Title: On the Theory of Policy Gradient Methods Optimality Approximation and Distribution Shift 2020 Author: Alekh Agarwal et. al. Publish Year: 14 Oct 2020 Review Date: Wed, Dec 28, 2022 Summary of paper Motivation little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution and how they cope with approximation error due to using a restricted class of parametric policies....

<span title='2022-12-28 14:36:20 +1100 AEDT'>December 28, 2022</span>&nbsp;ยท&nbsp;3 min&nbsp;ยท&nbsp;557 words&nbsp;ยท&nbsp;Sukai Huang

Chloe_ching_yun_hsu Revisiting Design Choices in Proximal Policy Optimisation 2020

[TOC] Title: Revisiting Design Choices in Proximal Policy Optimisation Author: Chloe Ching-Yun Hsu et. al. Publish Year: 23 Sep 2020 Review Date: Wed, Dec 28, 2022 Summary of paper Motivation Contribution on discrete action space with sparse high rewards, standard PPO often gets stuck at suboptimal actions. Why analyze the reason fort these failure modes and explain why they are not exposed by standard benchmarks In summary, our study suggests that Beta policy parameterization and KL-regularized objectives should be reconsidered for PPO, especially when alternatives improves PPO in all settings....

<span title='2022-12-28 14:32:15 +1100 AEDT'>December 28, 2022</span>&nbsp;ยท&nbsp;3 min&nbsp;ยท&nbsp;467 words&nbsp;ยท&nbsp;Sukai Huang

James_queeney Generalized Proximal Policy Optimisation With Sample Reuse 2021

[TOC] Title: Generalized Proximal Policy Optimisation With Sample Reuse 2021 Author: James Queeney et. al. Publish Year: 29 Oct 2021 Review Date: Wed, Dec 28, 2022 Summary of paper Motivation it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while off-policy methods make more efficient use of data through sample reuse. Contribution in this work, we combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms....

<span title='2022-12-28 14:00:32 +1100 AEDT'>December 28, 2022</span>&nbsp;ยท&nbsp;5 min&nbsp;ยท&nbsp;1033 words&nbsp;ยท&nbsp;Sukai Huang

Young_wu Reward Poisoning Attacks on Offline Multi Agent Reinforcement Learning 2022

[TOC] Title: Reward Poisoning Attacks on Offline Multi Agent Reinforcement Learning Author: Young Wu et. al. Publish Year: 1 Dec 2022 Review Date: Tue, Dec 27, 2022 Summary of paper Motivation Contribution unlike attacks on single-agent RL, we show that the attacker can install the target poilcy as a Markov Perfect Dominant Strategy Equilibrium (MPDSE), which rational agents are guaranteed to follow. This attack can be significantly cheaper than separate single-agent attacks....

<span title='2022-12-27 22:50:14 +1100 AEDT'>December 27, 2022</span>&nbsp;ยท&nbsp;1 min&nbsp;ยท&nbsp;146 words&nbsp;ยท&nbsp;Sukai Huang

Kiarash_banihashem Defense Against Reward Poisoning Attacks in Reinforcement Learning 2021

[TOC] Title: Defense Against Reward Poisoning Attacks in Reinforcement Learning Author: Kiarash Banihashem et. al. Publish Year: 20 Jun 2021 Review Date: Tue, Dec 27, 2022 Summary of paper Motivation our goal is to design agents that are robust against such attacks in terms of the worst-case utility w.r.t. the true unpoisoned rewards while computing their policies under the poisoned rewards. Contribution we formalise this reasoning and characterize the utility of our novel framework for designing defense policies....

<span title='2022-12-27 18:27:17 +1100 AEDT'>December 27, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;303 words&nbsp;ยท&nbsp;Sukai Huang

Amin_rakhsha Reward Poisoning in Reinforcement Learning Attacks Against Unknown Learners in Unknown Environments 2021

[TOC] Title: Reward Poisoning in Reinforcement Learning Attacks Against Unknown Learners in Unknown Environments Author: Amin Rakhsha et. al. Publish Year: 16 Feb 2021 Review Date: Tue, Dec 27, 2022 Summary of paper Motivation Our attack makes minimum assumptions on the prior knowledge of the environment or the learnerโ€™s learning algorithm. most of the prior work makes strong assumptions on the knowledge of adversary โ€“ it often assumed that the adversary has full knowledge of the environment or the agentโ€™s learning algorithm or both....

<span title='2022-12-27 15:50:22 +1100 AEDT'>December 27, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;233 words&nbsp;ยท&nbsp;Sukai Huang

Xuezhou_zhang Adaptive Reward Poisoning Attacks Against Reinforcement Learning 2020

[TOC] Title: Adaptive Reward Poisoning Attacks Against Reinforcement Learning Author: Xuezhou Zhang et. al. Publish Year: 22 Jun, 2020 Review Date: Tue, Dec 27, 2022 Summary of paper Motivation Non-adaptive attacks have been the focus of prior works. However, we show that under mild conditions, adaptive attacks can achieve the nefarious policy in steps polynomial in state-space size $|S|$ whereas non-adaptive attacks require exponential steps. Contribution we provide a lower threshold below which reward-poisoning attack is infeasible and RL is certified to be safe....

<span title='2022-12-27 00:21:15 +1100 AEDT'>December 27, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;283 words&nbsp;ยท&nbsp;Sukai Huang

Proximal Policy Optimisation Explained Blog

[TOC] Title: Proximal Policy Optimisation Explained Blog Author: Xiao-Yang Liu; DI engine Publish Year: May 4, 2021 Review Date: Mon, Dec 26, 2022 Highly recommend reading this blog https://lilianweng.github.io/posts/2018-04-08-policy-gradient/ https://zhuanlan.zhihu.com/p/487754664 Difference between on-policy and off-policy For on-policy algorithms, they update the policy network based on the transitions generated by the current policy network. The critic network would make a more accurate value-prediction for the current policy network in common environments. For off-policy algorithms, they allow to update the current policy network using the transitions from old policies....

<span title='2022-12-26 19:50:35 +1100 AEDT'>December 26, 2022</span>&nbsp;ยท&nbsp;1 min&nbsp;ยท&nbsp;196 words&nbsp;ยท&nbsp;Sukai Huang

Tom_everitt Reinforcement Learning With a Corrupted Reward Channel 2017

[TOC] Title: Reinforcement Learning With a Corrupted Reward Channel Author: Tom Everitt Publish Year: August 22, 2017 Review Date: Mon, Dec 26, 2022 Summary of paper Motivation we formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards Contribution two ways around the problem are investigated....

<span title='2022-12-26 01:11:23 +1100 AEDT'>December 26, 2022</span>&nbsp;ยท&nbsp;4 min&nbsp;ยท&nbsp;757 words&nbsp;ยท&nbsp;Sukai Huang

Yunhan_huang Manipulating Reinforcement Learning Stealthy Attacks on Cost Signals 2020

[TOC] Title: Manipulating Reinforcement Learning Stealthy Attacks on Cost Signals Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals Author: Yunhan Huang et. al. Publish Year: 2020 Review Date: Sun, Dec 25, 2022 Summary of paper Motivation understand the impact of the falsification of cost signals on the convergence of Q-learning algorithm Contribution In Q-learning, we show that Q-learning algorithms converge under stealthy attacks and bounded falsifications on cost signals. and there is a robust region within which the adversarial attacks cannot achieve its objective....

<span title='2022-12-25 19:12:17 +1100 AEDT'>December 25, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;336 words&nbsp;ยท&nbsp;Sukai Huang

Vincent_zhuang No Regret Reinforcement Learning With Heavy Tailed Rewards 2021

[TOC] Title: No-Regret Reinforcement Learning With Heavy Tailed Rewards Author: Vincent Zhuang et. al. Publish Year: 2021 Review Date: Sun, Dec 25, 2022 Summary of paper Motivation To the best of our knowledge, no prior work has considered our setting of heavy-tailed rewards in the MDP setting. Contribution We demonstrate that robust mean estimation techniques can be broadly applied to reinforcement learning algorithms (specifically confidence-based methods) in order to provably han- dle the heavy-tailed reward setting Some key terms Robust UCB algorithm...

<span title='2022-12-25 18:15:53 +1100 AEDT'>December 25, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;225 words&nbsp;ยท&nbsp;Sukai Huang

Wenshuai_zhao Towards Closing the Sim to Real Gap in Collaborative Multi Robot Deep Reinforcement Learning 2020

[TOC] Title: Towards Closing the Sim to Real Gap in Collaborative Multi Robot Deep Reinforcement Learning Author: Wenshuai Zhao et. al. Publish Year: 2020 Review Date: Sun, Dec 25, 2022 Summary of paper Motivation we introduce the effect of sensing, calibration, and accuracy mismatches in distributed reinforcement learning we discuss on how both the different types of perturbations and how the number of agents experiencing those perturbations affect the collaborative learning effort Contribution This is, to the best of our knowledge, the first work exploring the limitation of PPO in multi-robot systems when considering that different robots might be exposed to different environment where their sensors or actuators have induced errors...

<span title='2022-12-25 16:54:11 +1100 AEDT'>December 25, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;365 words&nbsp;ยท&nbsp;Sukai Huang

Jan_corazza Reinforcement Learning With Stochastic Reward Machines 2022

[TOC] Title: Reinforcement Learning With Stochastic Reward Machines Author: Jan Corazza et. al. Publish Year: AAAI 2022 Review Date: Sat, Dec 24, 2022 Summary of paper Motivation reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequence of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise....

<span title='2022-12-24 22:36:07 +1100 AEDT'>December 24, 2022</span>&nbsp;ยท&nbsp;3 min&nbsp;ยท&nbsp;465 words&nbsp;ยท&nbsp;Sukai Huang

Oguzhan_dogru Reinforcement Learning With Constrained Uncertain Reward Function Through Particle Filtering 2022

[TOC] Title: Reinforcement Learning With Constrained Uncertain Reward Function Through Particle Filtering Author: Oguzhan Dogru et. al. Publish Year: July 2022 Review Date: Sat, Dec 24, 2022 Summary of paper Motivation this study consider a type of uncertainty, which is caused by the sensor that are utilised for reward function. When the noise is Gaussian and the system is linear Contribution this work used โ€œparticle filteringโ€ technique to estimate the true reward function from the perturbed discrete reward sampling points....

<span title='2022-12-24 19:32:25 +1100 AEDT'>December 24, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;297 words&nbsp;ยท&nbsp;Sukai Huang

Inaam_ilahi Challenges and Countermeasures for Adversarial Attacks on Reinforcement Learning 2022

[TOC] Title: Challenges and Countermeasures for Adversarial Attacks on Reinforcement Learning Author: Inaam Ilahi et. al. Publish Year: 13 Sep 2021 Review Date: Sat, Dec 24, 2022 Summary of paper Motivation DRL is susceptible to adversarial attacks, which precludes its use in real-life critical system and applications. Therefore, we provide a comprehensive survey that discusses emerging attacks on DRL-based system and the potential countermeasures to defend against these attacks. Contribution we provide the DRL fundamentals along with a non-exhaustive taxonomy of advanced DRL algorithms we present a comprehensive survey of adversarial attacks on DRL and their potential countermeasures we discuss the available benchmarks and metrics for the robustness of DRL finally, we highlight the open issues and research challenges in the robustness of DRL and introduce some potential research directions ....

<span title='2022-12-24 17:06:12 +1100 AEDT'>December 24, 2022</span>&nbsp;ยท&nbsp;3 min&nbsp;ยท&nbsp;517 words&nbsp;ยท&nbsp;Sukai Huang

Zuxin_liu on the Robustness of Safe Reinforcement Learning Under Observational Perturbations 2022

[TOC] Title: On the Robustness of Safe Reinforcement Learning Under Observational Perturbations Author: Zuxin Liu et. al. Publish Year: 3 Oct 2022 Review Date: Thu, Dec 22, 2022 Summary of paper Motivation While many recent safe RL methods with deep policies can achieve outstanding constraint satisfaction in noise-free simulation environment, such a concern regarding their vulnerability under adversarial perturbation has not been studies in the safe RL setting. Contribution we are the first to formally analyze the unique vulnerability of the optimal policy in safe RL under observational corruptions....

<span title='2022-12-22 22:38:13 +1100 AEDT'>December 22, 2022</span>&nbsp;ยท&nbsp;3 min&nbsp;ยท&nbsp;532 words&nbsp;ยท&nbsp;Sukai Huang

Ruben_majadas Disturbing Reinforcement Learning Agents With Corrupted Rewards 2021

[TOC] Title: Disturbing Reinforcement Learning Agents With Corrupted Rewards Author: Ruben Majadas et. al. Publish Year: Feb 2021 Review Date: Sat, Dec 17, 2022 Summary of paper Motivation recent works have shown how the performance of RL algorithm decreases under the influence of soft changes in the reward function. However, little work has been done about how sensitive these disturbances are depending on the aggressiveness of the attack and the learning learning exploration strategy....

<span title='2022-12-17 00:38:35 +1100 AEDT'>December 17, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;383 words&nbsp;ยท&nbsp;Sukai Huang

Jingkang_wang Reinforcement Learning With Perturbed Rewards 2020

[TOC] Title: Reinforcement Learning With Perturbed Rewards Author: Jingkang Wang et. al. Publish Year: 1 Feb 2020 Review Date: Fri, Dec 16, 2022 Summary of paper Motivation this paper studies RL with perturbed rewards, where a technical challenge is to revert the perturbation process so that the right policy is learned. Some experiments are used to support the algorithm (i.e., estimate the confusion matrix and revert) using existing techniques from the supervised learning (and crowdsourcing) literature....

<span title='2022-12-16 20:48:51 +1100 AEDT'>December 16, 2022</span>&nbsp;ยท&nbsp;2 min&nbsp;ยท&nbsp;402 words&nbsp;ยท&nbsp;Sukai Huang