Reinforcement Learning

Jacob_andreas Language Models as Agent Models 2022

[TOC] Title: Language Models as Agent Models Author: Jacob Andreas Publish Year: 3 Dec 2022 Review Date: Sat, Dec 10, 2022 https://arxiv.org/pdf/2212.01681.pdf Summary of paper Motivation during training, LMs have access only to the text of the documents, with no direct evidence of the internal states of the human agent that produce them. (kind of hidden MDP thing) this is a fact often used to argue that LMs are incapable of modelling goal-directed aspects of human language production and comprehension. The author stated that even in today’s non-robust and error-prone models – LM infer and use representations of fine-grained communicative intensions and more abstract beliefs and goals. Despite that limited nature of their training data, they can thus serve as building blocks for systems that communicate and act intentionally. In other words, the author said that language model can be used to communicate intention of human agent, and hence it can be treated as a agent model. Contribution the author claimed that in the course of performing next-word prediction in context, current LMs sometimes infer inappropriate, partial representations of beliefs ,desires and intentions possessed by the agent that produced the context, and other agents mentioned within it. Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communicative actions. The high-level goals of this paper are twofold: first, to outline a specific sense in which idealised language models can function as models of agent belief, desires and intentions; second, to highlight a few cases in which existing models appear to approach this idealization (and describe the ways in which they still fall short) Training on text alone produces ready-made models of the map from agent states to text; these models offer a starting point for language processing systems that communicate intentionally. Some key terms Current language model is bad ...

Steven_kapturowski Human Level Atari 200x Faster 2022

[TOC] Title: Human Level Atari 200x Faster Author: Steven Kapturowski et. al. DeepMind Publish Year: September 2022 Review Date: Wed, Oct 5, 2022 Summary of paper https://arxiv.org/pdf/2209.07550.pdf Motivation Agent 57 came at the cost of poor data-efficiency , requiring nearly 80,000 million frames of experience to achieve. this one can achieve the same performance in 390 million frames Contribution Some key terms NFNet - Normalisation Free Network https://towardsdatascience.com/nfnets-explained-deepminds-new-state-of-the-art-image-classifier-10430c8599ee Batch normalisation – the bad it is expensive batch normalisation breaks the assumption of data independence NFNet applies 3 different techniques: Modified residual branches and convolutions with Scaled Weight standardisation Adaptive Gradient Clipping Architecture optimisation for improved accuracy and training speed. https://github.com/vballoli/nfnets-pytorch Previous Non-Image features ...

Andrea_banino Coberl Contrastive Bert for Reinforcement Learning 2022

[TOC] Title: CoBERL Contrastive BERT for Reinforcement Learning Author: Andrea Banino et. al. DeepMind Publish Year: Feb 2022 Review Date: Wed, Oct 5, 2022 Summary of paper https://arxiv.org/pdf/2107.05431.pdf Motivation Contribution Some key terms Representation learning in reinforcement learning motivation: if state information could be effectively extracted from raw observations it may then be possible to learn from there as fast as from states. however, given the often sparse reward signal coming from the environment, learning representations in RL has to be achieved with little to no supervision. approach types class 1: auxiliary self-supervised losses to accelerate the learning speed in model-free RL algorithm class 2: learn a world model and use this to collect imagined rollouts, which then act as extra data to train the RL algorithm reducing the samples required from the environment CoBERL is in class 1 it uses both masked language modelling and contrastive learning RL using BERT architecture – RELIC ...

Alex_petrekno Sample Factory Asynchronous Rl at Very High Fps 2020

[TOC] Title: Sample Factory: Asynchronous Rl at Very High FPS Author: Alex Petrenko Publish Year: Oct, 2020 Review Date: Sun, Sep 25, 2022 Summary of paper Motivation Identifying performance bottlenecks RL involves three workloads: environment simulation inference backpropagation overall performance depends on the lowest workload In existing methods (A2C/PPO/IMPALA) the computational workloads are dependent -> under-utilisation of the system resources. Existing high-throughput methods focus on distributed training, therefore introducing a lot of overhead such as networking serialisation, etc. ...

Dongwon Fire Burns Sword Cuts Commonsense Inductive Bias for Exploration in Text Based Games 2022

[TOC] Title: Fire Burns, Sword Cuts: Commonsense Inductive Bias for Exploration in Text Based Games Author: Dongwon Kelvin Ryu et. al. Publish Year: ACL 2022 Review Date: Thu, Sep 22, 2022 Summary of paper Motivation Text-based games (TGs) are exciting testbeds for developing deep reinforcement learning techniques due to their partially observed environments and large action space. A fundamental challenges in TGs is the efficient exploration of the large action space when the agent has not yet acquired enough knowledge about the environment. So, we want to inject external commonsense knowledge into the agent during training when the agent is most uncertain about its next action. Contribution In addition to performance increase, the produced trajectory of actions exhibit lower perplexity, when tested with a pre-trained LM, indicating better closeness to human language. Some key terms Exploration efficiency ...

Younggyo_seo Masked World Models for Visual Control 2022

[TOC] Title: Masked World Models for Visual Control 2022 Author: Younggyo Seo et. al. Publish Year: 2022 Review Date: Fri, Jul 1, 2022 https://arxiv.org/abs/2206.14244?context=cs.AI https://sites.google.com/view/mwm-rl Summary of paper Motivation TL:DR: Masked autoencoders (MAE) has emerged as a scalable and effective self-supervised learning technique. Can MAE be also effective for visual model-based RL? Yes! with the recipe of convolutional feature masking and reward prediction to capture fine-grained and task-relevant information. Some key terms Decouple visual representation learning and dynamics learning ...

Hao_hu Generalisable Episodic Memory for Drl 2021

[TOC] Title: Generalisable episodic memory for Deep Reinforcement Learning Author: Hao Hu et. al. Publish Year: Jun 2021 Review Date: April 2022 Summary of paper Motivation The author proposed Generalisable Episodic Memory (GEM), which effectively organises the state-action values of episodic memory in a generalisable manner and supports implicit planning on memorised trajectories. so compared to traditional memory table, GEM learns a virtual memory table memorized by deep neural networks to aggregate similar state-action pairs that essentially have the same nature. ...

Ilya_kostrikov Offline Rl With Implicit Q Learning 2021

[TOC] Title: Offline Reinforcement Learning with Implicit Q-learning Author:Ilya Kostrikov et. al. Publish Year: 2021 Review Date: Mar 2022 Summary of paper Motivation conflict in offline reinforcement learning offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behaviour policy (old policy) that collected the dataset while at the same time minimizing the deviation from the behaviour policy so as to avoid errors due to distributional shift (e.g., obtain out of distribution actions) -> the challenge is how to constrain those unseen actions to be in-distribution. (meaning there is no explicit Q-function for actions, and thus the issue of unseen action is gone) all the previous solutions like 1. limit how far the new policy deviates from the behaviour policy and 2. assign low value to out of distribution actions impose a trade-off between how much the policy improve and how vulnerable it is to misestimation due to distributional shift. ...

Qinqing_zheng Online Decision Transformer 2022

[TOC] Title: Online Decision Transformer Author: Qinqing Zheng Publish Year: Feb 2022 Review Date: Mar 2022 Summary of paper Motivation the author proposed online Decision transformer (ODT), an RL algorithm based on sequence modelling that blends offline pretraining with online fine-tuning in a unified framework. ODT builds on the decision transformer architecture previously introduced for offline RL quantify exploration compared to DT, they shifted from deterministic to stochastic policies for defining exploration objectives during the online phase. They quantify exploration via the entropy of the policy similar to max-ent RL frameworks. ...

Machel_reid Can Wikipedia Help Offline Rl 2022

[TOC] Title: Can Wikipedia Help Offline Reinforcement Learning Author: Machel Reid et. al. Publish Year: Mar 2022 Review Date: Mar 2022 Summary of paper Motivation Fine-tuning reinforcement learning (RL) models has been challenging because of a lack of large scale off-the-shelf datasets as well as high variance in transferability among different environments. Moreover, when the model is trained from scratch, it suffers from slow convergence speeds In this paper, they look to take advantage of this formulation of reinforcement learning as sequence modelling and investigate the transferability of pre-trained sequence models on other domains (vision, language) when fine tuned on offline RL tasks (control, games). ...

Wenfeng_feng Extracting Action Sequences From Texts by Rl

[TOC] Title: Extracting Action Sequences from Texts Based on Deep Reinforcement Learning Author: Wenfeng Feng et. al. Publish Year: Mar 2018 Review Date: Mar 2022 Summary of paper Motivation the author want to build a model that learns to directly extract action sequences without external tools like POS tagging and dependency parsing results… Annotation dataset structure example Model they exploit the framework to learn two models to predict action names and arguments respectively. ...

Giuseppe_de_giacomo Foundations for Retraining Bolts Rl With Ltl 2019

[TOC] Title: Foundations for Restraining Bolts: Reinforcement Learning with LTLf/LDLf Restraining Specification Author: Giuseppe De Giacomo et. al. Publish Year: 2019 Review Date: Mar 2022 Summary of paper The author investigated the concept of “restraining bolt” that can control the behaviour of learning agents. Essentially, the way to control a RL agent is that the bolt provides additional rewards to the agent Although this method is essentially the same as reward shaping (providing additional rewards to the agent), the contribution of this paper is ...

Joseph_kim Collaborative Planning With Encoding of High Level Strategies 2017

please modify the following [TOC] Title: Collaborative Planning with Encoding of Users’ High-level Strategies Author: Joseph Kim et. al. Publish Year: 2017 Review Date: Mar 2022 Summary of paper Motivation Automatic planning is computationally expensive. Greedy search heuristics often yield low-quality plans that can result in wasted resources; also, even in the event that an adequate plan is generated, users may have difficulty interpreting the reason why the plan performs well and trusting it. ...

Mikayel_samvelyan Minihack the Planet a Sandbox for Open Ended Rl Research 2021

[TOC] Title: MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research Author: Mikayel Samvelyan et. al. Publish Year: Nov 2021 Review Date: Mar 2022 Summary of paper They presented MiniHack, an easy-to-use framework for creating rich and varied RL environments, as well as a suite of tasks developed using this framework. Built upon NLE and the des-file format, MiniHack enables the use of rich entities and dynamics from the game of NetHack to create a large variety of RL environments for targeted experimentation, while also allowing painless scaling-up of the difficulty of existing environments. MiniHack’s environments are procedurally generated by default, ensuring the evaluation of systematic generalization of RL agents. ...

Richard_shin Constrained Language Models Yield Few Shot Semantic Parsers 2021

[TOC] Title: Constrained Language models yield few-shot semantic parsers Author: Richard Shin et. al. Publish Year: Nov 2021 Review Date: Mar 2022 Summary of paper Motivation The author wanted to explore the use of large pretrained language models as few-shot semantic parsers However, language models are trained to generate natural language. To bridge the gap, they used language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation. (using synchronous context-free grammar SCFG) ...

Heinrich_kuttler the Nethack Learning Environment 2020

[TOC] Title: The NetHack Learning Environment Author: Heinrich Kuttler et. al. Publish Year: Dec 2020 Review Date: Mar 2022 Summary of paper The NetHack Learning Environment (NLE), a scalable, procedurally generated, stochastic, rich, and challenging environment for RL research based on the popular single-player terminal-based roguelike game, NetHack. NetHack is sufficiently complex to drive long-term research on problems such as exploration, planning, skill acquisition, and language-conditioned RL, while dramatically reducing the computational resources required to gather a large amount of experience. ...

Pashootan_vaezipoor Ltl2action Generalising Ltl Instructions for Multi Task Rl 2021

please modify the following [TOC] Title: LTL2Action: Generalizing LTL Instructions for Multi-Task RL Author: Pashootan Vaezipoor et. al. Publish Year: 2021 Review Date: March 2022 Summary of paper Motivation they addressed the problem of teaching a deep reinforcement learning agent to follow instructions in multi-task environments. Instructions are expressed in a well-known formal language – linear temporal logic (LTL) Limitation of the vanilla MDP temporal constraints cannot be expressed as rewards in MDP setting and thus modular policy and other stuffs are not able to obtain maximum rewards. ...

Roma_patel Learning to Ground Language Temporal Logical Form 2019

[TOC] Title: Learning to Ground Language to Temporal Logical Form Author: Roma Patel et. al. Publish Year: 2019 Review Date: Feb 2022 Summary of paper Motivation natural language commands often exhibits sequential (temporal) constraints e.g., “go through the kitchen and then into the living room”. But this constraints cannot be expressed in the reward of Markov Decision Process setting. (see this paper) Therefore, they proposed to ground language to Linear Temporal logic (LTL) and after that continue to map from LTL expressions to action sequences. ...

Tianshi_cao Babyai Plus Plus Towards Grounded Language Learning Beyond Memorization 2020

[TOC] Title: BABYAI++: Towards Grounded-Language Learning Beyond Memorization Author: Tianshi Cao et. al. Publish Year: 2020 ICLR Review Date: Jan 2022 Summary of paper The paper introduced a new RL environment BabyAI++ that can investigate whether RL agents can extract knowledge from descriptive text and eventually increase generalisation performance. BabyAI++ environment example the descriptive text describe the feature of the object. notice that the feature of object can easily change as we change the descriptive text. Model ...

Lili_chen Decision Transformer Reinforcement Learning via Sequence Modeling 2021

[TOC] Title: Decision Transformer: Reinforcement Learning via Sequence Modeling Author: Lili Chen et. al. Publish Year: Jun 2021 Review Date: Dec 2021 Summary of paper The Architecture of Decision Transformer Inputs are reward, observation and action Outputs are action, in training time, the future action will be masked out. I believe this model is able to generate a very good long sequence of actions due to transformer architecture. But somehow this is not RL anymore because the transformer is not trained by reward signal … ...

Yiding_jiang Language as Abstraction for Hierarchical Deep Reinforcement Learning

[TOC] Title: Language as an Abstraction for Hierarchical Deep Reinforcement Learning Author: Yiding Jiang et. al. Publish Year: 2019 NeurIPS Review Date: Dec 2021 Summary of paper Solving complex, temporally-extended tasks is a long-standing problem in RL. Acquiring effective yet general abstractions for hierarchical RL is remarkably challenging. Therefore, they propose to use language as the abstraction, as it provides unique compositional structure, enabling fast learning and combinatorial generalisation ...

David_abel on the Expressivity of Markov Reward 2021

[TOC] Title: On the Expressivity of Markov Reward Author: David Abel et. al. Publish Year: NuerIPS 2021 Review Date: 6 Dec 2021 Summary of paper This needs to be only 1-3 sentences, but it demonstrates that you understand the paper and, moreover, can summarize it more concisely than the author in his abstract. The author found out that in the Markov Decision Process scenario, (i.e., we do not look at the history of the trajectory to provide rewards), some tasks cannot be realised perfectly by reward functions. i.e., ...

Rishabh_agarwal Deep Reinforcement Learning at the Edge of the Stats Precipice 2021

[TOC] Title: Deep Reinforcement Learning at the Edge of the Statistical Precipice Author: Rishabh Agarwal et. al. Publish Year: NeurIPS 2021 Review Date: 3 Dec 2021 Summary of paper This needs to be only 1-3 sentences, but it demonstrates that you understand the paper and, moreover, can summarize it more concisely than the author in his abstract. Most current published results on deep RL benchmarks uses point estimate of aggregate performance such as mean and median score across the task. ...

Borja_ibarz Reward Learning From Human Preferences and Demonstrations in Atari 2018

[TOC] Title: Reward learning from human preferences and demonstractions in Atari Author: Borja Ibarz et. al. Publish Year: 2018 Review Date: Nov 2021 Summary of paper This needs to be only 1-3 sentences, but it demonstrates that you understand the paper and, moreover, can summarize it more concisely than the author in his abstract. The author proposed a method that uses human expert’s annotation rather than extrinsic reward from the environment to guide the reinforcement learning. ...

Adrien_ecoffet Go Explore a New Approach for Hard Exploration Problems 2021 Paper Review

[TOC] Title: Go-Explore: a New Approach for Hard-Exploration Problems Author: Adrien Ecoffet et. al. Publish Year: 2021 Review Date: Nov 2021 Summary of paper This needs to be only 1-3 sentences, but it demonstrates that you understand the paper and, moreover, can summarize it more concisely than the author in his abstract. The author hypothesised that there are two main issues that prevent DRL agents from achieving high score in exploration-hard game (e.g., Montezuma’s Revenge) ...