Ying_shen Learning by Asking for Embodied Visual Navigation and Task Completion 2023

[TOC] Title: Learning by Asking for Embodied Visual Navigation and Task Completion Author: Ying Shen et. al. Publish Year: 9 Feb 2023 Review Date: Thu, Mar 2, 2023 url: https://arxiv.org/pdf/2302.04865.pdf Summary of paper Motivation despite recent progress on related vision-language benchmarks, most prior work has focused on building agents that follow instructions rather than endowing agents the ability to ask questions to actively resolve ambiguities arising naturally in embodied environments. Contribution ...

March 2, 2023 · 2 min · 411 words · Sukai Huang

Ernest_davis Benchmarks for Automated Commonsense Reasoning a Survey 2023

[TOC] Title: Benchmarks for Automated Commonsense Reasoning a Survey Author: Ernest Davis Publish Year: 9 Feb 2023 Review Date: Thu, Mar 2, 2023 url: https://arxiv.org/pdf/2302.04752.pdf Summary of paper we mainly focus on the section where the author discusses about features of commonsense reasoning generally. Terms clarify what we mean by common sense what is exactly “commonsensical”? Claims about common sense that seem true to the author Commonsense knowledge is common. In talking to other person, we do not have to explain common sense reasoning or enumerate common sense facts. We can assume that they know that unsupported things fall down, that outside the tropics, days in temperate regions are generally warmer than winter, and so on. Common sense is largely sensible. Any individual person or even an entire society may have various foolish or mistaken beliefs, but for the most part common sense knowledge correponds to the realities of the world as people experience it. Common sense supports reasoning. For example a person who knows that Central Park is in New York and the Golden Gate Bridge is in San Francisco and that New York and San Francisco are 3000 miles apart will realize that they cannot walk from one to the other in fifteen minutes. commonsense reasoning is integrated with other cognitive abilities Common sense extends across tasks and modalities Common sense is a broad scope Commonsense knowledge can be distinguished from common knowledge, encyclopaedic knowledge and expert knowledge Half-truths about commonsense knowledge Commonsense knowledge is language-independent The English-language bias is as pervasive in commmonsense reasoning as in other areas of AI. Impressively, versions of ConceptNet with at least 10,000 concepts exist in 83 different languages, and a few commonsense benchmarks have been translated (table 4) but most resources and benchmarks only exist in English or in a symbolic form in which the symbols are in fact English words or short phrases. Commonsense knowledge is the same for people of different cultures and of different historical periods Even if a belief has been commonsense knowledge for everyone at all times up to the present, that does not mean that that will continue in the future. Commonsense reasoning is fast and intuitive; it falls within “System 1” Processes in System 1 characteristically are executed quickly, do not require conscious thought, are not open to introspection, in at least in some cases are not controllable (one cannot decide not to interpret what one is seeing), and do not place a cognitive burden on working memory; vision is a paradigmatic example. Processes in System 2 are the reverse: slow, consciously carried out, consciously controllable, instrospectable, and taxing on working memory. System 2 processes can call on system 1 but not vice versa, since a fast process cannot use a slow subroutine. encyclopaedic and expert knowledge can also be called on in System 1 activities Commonsense knowledge can be expressed using simple language it seems plausible: basic vocabulary tends to refer to the well-known concepts and relations which are the subject of commonsense knowledge however, there is a very large exception here, which is commonsense spatial knowledge. Natural language is notoriously ill-suited to the description of characteristics of shapes and positions that are easily apprehended (bad expressivity of natural language) An untrue claim about commonsense knowledge commonsense knowledge is not logically complex However, in physical reasoning, understanding the physical characteristics could be quite complex (e.g., considering angry birds). But humans are good at playing angry birds.

March 2, 2023 · 3 min · 573 words · Sukai Huang

Alexander_nikulin Anti Exploration by Random Network Distillation 2023

[TOC] Title: Anti Exploration by Random Network Distillation Author: Alexander Nikulin et. al. Publish Year: 31 Jan 2023 Review Date: Wed, Mar 1, 2023 url: https://arxiv.org/pdf/2301.13616.pdf Summary of paper Motivation despite the success of Random Network Distillation (RND) in various domains, it was shown as not discriminative enough to be used as an uncertainty estimator for penalizing out-of-distribution actions in offline reinforcement learning ?? wait, why we want to penalizing out-of-distribution actions? Contribution With a naive choice of conditioning for the RND prior, it becomes infeasible for the actor to effectively minimize the anti-exploration bonus and discriminativity is not an issue. We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM), resulting in a simple and efficient ensemble-free algorithm based on Soft Actor-Critic. Some key terms why we want uncertainty-based penalization ...

March 1, 2023 · 2 min · 359 words · Sukai Huang

Edoardo_cetin Learning Pessimism for Reinforcement Learning 2023

[TOC] Title: Learning Pessimism for Reinforcement Learning Author: Edoardo Cetin et. al. Publish Year: 2023 Review Date: Wed, Mar 1, 2023 url: https://kclpure.kcl.ac.uk/portal/files/196848783/10977.CetinE.pdf Summary of paper Motivation Off-policy deep RL algorithms commonly compensate for overestimation bias during temporal difference learning by utilizing pessimistic estimates of the expected target returns Contribution we propose Generalised Pessimism Learning (GPL), a strategy employing a novel learnable penalty to enact such pessimism. In particular we propose to learn this penalty alongside the critic with dual TD-learning, a new procedure to estimate and minimise the magnitude of the target returns bias with trivial computational cost. Some key terms We attribute recent improvements on RL algs to two main linked advances: ...

March 1, 2023 · 2 min · 222 words · Sukai Huang

Timo_schick Toolformer Language Models Can Teach Themselves to Use Tools 2023

[TOC] Title: Toolformer: Language Models Can Teach Themselves to Use Tools 2023 Author: Timo Schick et. al. META AI research Publish Year: 9 Feb 2023 Review Date: Wed, Mar 1, 2023 url: https://arxiv.org/pdf/2302.04761.pdf Summary of paper Motivation LMs exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also struggle with basic functionality, such as arithmetic or factual lookup. Contribution In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model that incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system and a calendar. Some key terms limitation of language models ...

March 1, 2023 · 3 min · 486 words · Sukai Huang

Almog_gueta Knowledge Is a Region in Weight Space for Fine Tuned Language Model 2023

[TOC] Title: Knowledge Is a Region in Weight Space for Fine Tuned Language Model Author: Almog Gueta et. al. Publish Year: 12 Feb 2023 Review Date: Wed, Mar 1, 2023 url: https://arxiv.org/pdf/2302.04863.pdf Summary of paper Motivation relatively little is known a bout the relationships between different models, especially those trained or tested on different datasets. Contribution we demonstrate that fine-tuned models that were optimized for high performance, reside in well-defined regions in weight space, and vice versa language models that have been fine-tuned on the same dataset form a tight cluster in the same weight space and that models fine-tuned on different datasets from the same underlying task form a looser cluster. traversing around the region between the models reaches new models that perform comparably or even better than models found via fine-tuning Our findings demonstrate that a model positioned between two similar models can acquire the knowledge of both. We leverage this finding and design a method to pick a better model for efficient fine-tuning. more findings ...

March 1, 2023 · 3 min · 548 words · Sukai Huang

Xiwen_liang Contrastive Instruction Trajectory Learning for Vision Language Navigation 2022

[TOC] Title: Contrastive Instruction Trajectory Learning for Vision Language Navigation Author: Xiwen Liang et. al. Publish Year: AAAI 2022 Review Date: Fri, Feb 10, 2023 url: https://arxiv.org/abs/2112.04138 Summary of paper Motivation previous works learn to navigate step-by-step following an instruction. However, these works may fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions. These problems hinder agents from learning distinctive vision-and-language representations, Contribution we propose a coarse-grained contrastive learning objective to enhance vision-and-language representations by contrasting semantics of full trajectory observations and instructions respectively; a fine-grained contrastive learning objective to perceive instructions by leveraging the temporal information of the sub-instructions. a pairwise sample-reweighting mechanism for contrastive learning to sampling bias in contrastive learning. Some key terms Limitation of current VLN model ...

February 10, 2023 · 2 min · 360 words · Sukai Huang

Jacob_andreas Lammp Language Models as Probabilistic Priors for Perception and Action 2023

[TOC] Title: LAMMP Language Models as Probabilistic Priors for Perception and Action 2023 Author: Belinda Z. Li, Jacob Andreas et. al. Publish Year: 3 Feb 2023 Review Date: Fri, Feb 10, 2023 url: https://arxiv.org/pdf/2302.02801.pdf Summary of paper Motivation Language models trained on large text corpora encode rich distributional information about real-world environments and action sequences. this information plays a crucial role Contribution we describe how to leverage language models for non-linguistic perception and control tasks Our approach casts labelling and decision-making as inference in probabilistic graphical models in which language models parameterize prior distributions over labels, decisions and parameters, making it possible to integrate uncertain observations and incomplete background knowledge in a principled way. Some key terms common-sense priors ...

February 10, 2023 · 2 min · 267 words · Sukai Huang

Zhuosheng_zhang Multimodal Chain of Thought Reasoning in Language Models 2023

[TOC] Title: Multimodal Chain of Thought Reasoning in Language Models Author: Zhuosheng Zhang et. al. Publish Year: 2023 Review Date: Wed, Feb 8, 2023 url: https://arxiv.org/pdf/2302.00923.pdf Summary of paper Motivation LLMs have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. to elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. Contribution We propose Mutimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages, the model is able to generate effective rationales that contribute to answer inference. Some key terms Multimodal-CoT ...

February 8, 2023 · 3 min · 548 words · Sukai Huang

Siyuan_wang Unifying Structure Reasoning and Language Model Pre Training for Complex Reasoning 2023

[TOC] Title: Unifying Structure Reasoning and Language Model Pre Training for Complex Reasoning Author: Siyuan Wang et. al. Publish Year: 21 Jan 2023 Review Date: Wed, Feb 8, 2023 url: https://arxiv.org/pdf/2301.08913.pdf Summary of paper Motivation language models still suffer from a heterogeneous information alignment problem and a noisy knowledge injection problem. for complex reasoning, the context contains rich knowledge that typically exists in complex and sparse form. Contribution we propose to unify structure reasoning and language model pre-training identifies four types of elementary knowledge structures from contexts to construct structured queries utilise box embedding method to conduct explicit structure reasoning along query during language modeling Some key terms What is the problem ...

February 8, 2023 · 2 min · 281 words · Sukai Huang

Ekin_akyurek Towards Tracing Factual Knowledge in Language Models Back to the Training Data 2022

[TOC] Title: Towards Tracing Factual Knowledge in Language Models Back to the Training Data Author: Ekin Akyurek et. al. Publish Year: EMNLP 2022 Review Date: Wed, Feb 8, 2023 url: https://aclanthology.org/2022.findings-emnlp.180.pdf Summary of paper Motivation LMs have been shown to memorize a great deal of factual knowledge contained in their training data. But when an LM generates an assertion, it is often difficult to determine where it learned this information and whether it is true. Contribution we propose the problem of fact tracing identifying which training examples taught an LM to generate a particular factual assertion. prior work on training data distribution (TDA) may offer effective tools for identifying such examples, known as “proponent”. We present the first quantitative benchmark to evaluate this we compare two popular families of TDA methods gradient based embedding based Some key terms Training data distribution method (TDA) ...

February 8, 2023 · 2 min · 363 words · Sukai Huang

Danijar_hafner Mastering Diverse Domains Through World Models 2023

[TOC] Title: Mastering Diverse Domains Through World Models Author: Danijar Hafner et. al. Publish Year: 10 Jan 2023 Review Date: Tue, Feb 7, 2023 url: https://www.youtube.com/watch?v=vfpZu0R1s1Y Summary of paper Motivation general intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but held back by the resources and knowledge required tune them for new task. Contribution we present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters. we observe favourable scaling properties of DreamerV3, with larger models directly translating to higher data-efficiency and final performance. Some key terms World Model learning ...

February 7, 2023 · 2 min · 291 words · Sukai Huang

Yuanhan_zhang What Makes Good Examples for Visual in Context Learning 2023

[TOC] Title: What Makes Good Examples for Visual in Context Learning Author: Yuan Zhang et. al. Publish Year: 1 Feb 2023 Review Date: Mon, Feb 6, 2023 url: https://arxiv.org/pdf/2301.13670.pdf Summary of paper Motivation in this paper, the main focus is on an emergent ability in large vision models, known. as in-context learning this concept has been well-known in natural language processing but has only been studied very recently for large vision models. Contribution we for the first time provide a comprehensive investigation on the impact of in-context examples in computer vision, and find that the performance is highly sensitive to the choice of in-context examples. exposing a critical issue that different in-context examples could lead to drastically different results. Our methods obtain significant improvements over random selection under various problem settings, showing the potential of using prompt retrieval in vision applications with a Model-as-a-Service (MaaS) business structure. we show that a good in-context example should be semantically similar to the query and closer in context. A model that can better balance spatial and se- mantic closedness in feature space would be more ideal for visual in-context learning. yeah, it is because the model is not that smart in a way that it can directly tell the semantic regardless of what the spatial structure looks like Some key terms existing issue of using LLM ...

February 6, 2023 · 3 min · 427 words · Sukai Huang

Jing_yu_koh Grounding Language Models to Images for Multimodal Generation 2023

[TOC] Title: Grounding Language Models to Images for Multimodal Generation Author: Jing Yu Koh et. al. Publish Year: 31 Jan 2023 Review Date: Mon, Feb 6, 2023 url: https://arxiv.org/pdf/2301.13823.pdf Summary of paper Motivation we propose an efficient method to ground pre-trained text-only language models to the visual domain How we keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved Contribution our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pre-trained language models in visually grounded settings. Related work LLMs for vision-and-language ...

February 6, 2023 · 2 min · 239 words · Sukai Huang

Zhenfang_chen See Think Confirm Interactive Prompting Between Vision and Language Models for Knowledge Based Visual Reasoning 2023

[TOC] Title: See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge Based Visual Reasoning Author: Zhenfang Chen et. al. Publish Year: 12 Jan 2023 Review Date: Mon, Feb 6, 2023 url: https://arxiv.org/pdf/2301.05226.pdf Summary of paper Motivation Solving the knowledge-based visual reasoning tasks remains challenging, which requires a model to comprehensively understand image content, connect external world knowledge, and perform step-by-step reasoning to answer the questions correctly. Contribution We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge based visual reasoning. IPVR contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend the key concepts from candidates adaptively. It then transforms them into text context for prompting with a visual captioning model and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rational to the answer, verify the generated rationale with a cross-modality classifier and ensure that the rationale can infer the predicted output consistently. Some key terms human process to handle knowledge-based visual reasoning ...

February 6, 2023 · 2 min · 405 words · Sukai Huang

Xiaotian_liu a Planning Based Neural Symbolic Approach for Embodied Instruction Following 2022

[TOC] Title: A Planning Based Neural Symbolic Approach for Embodied Instruction Following Author: Xiaotian Liu et. al. Publish Year: 2022 Review Date: Thu, Feb 2, 2023 url: https://embodied-ai.org/papers/2022/15.pdf Summary of paper Motivation end-to-end deep learning methods struggle at these tasks due to long-horizon and sparse rewards. Contribution Our main innovation relies on combining DL models for perception and NLP with a new egocentric planner based on successive planning problems formulated using the PDDL syntax, both for exploration and task accomplishment. our planning framework can naturally recover from action failures at any stage of the planned trajectory. Some key terms Embodied Instruction Following ...

February 2, 2023 · 2 min · 226 words · Sukai Huang

So_yeon_min Film Following Instructions in Language With Modular Methods 2022

[TOC] Title: FILM: Following Instructions in Language With Modular Methods Author: So Yeon Min et. al. Publish Year: 16 Mar 2022 Review Date: Wed, Feb 1, 2023 url: https://arxiv.org/pdf/2110.07342.pdf Summary of paper Motivation current approaches assume that neural states will integrate multimodal semantics to perform state tracking, building spatial memory, exploration, and long-term planning. in contrast, we propose a modular method with structured representation that build a semantic map of scene and perform exploration with a semantic search policy, to achieve natural language goal. Contribution FILM consists of several modular components that each processes language instructions into structured forms (language processing) converts egocentric visual input into a semantic metric map (Semantic Mapping) predicts a search goal location (Semantic Search Policy) ? subgoal will be plotted as a dot on the semantic top-down map outputs subsequent navigation/interaction actions (Deterministic Policy) Some key terms embodied instruction following ...

February 1, 2023 · 3 min · 430 words · Sukai Huang

Yuki_inoue Prompter Utilizing Large Language Model Prompting for a Data Efficient Embodied Instruction Following 2022

[TOC] Title: Prompter: Utilizing Large Language Model Prompting for a Data Efficient Embodied Instruction Following Author: Yuki Inoue et. al. Publish Year: 7 Nov 2022 Review Date: Wed, Feb 1, 2023 url: https://arxiv.org/pdf/2211.03267.pdf Summary of paper Motivation we propose FILM++ which extends the existing work FILM with modifications that do not require extra data. furthermore, we propose Prompter, which replace FILM++’s semantic search module with language model prompting. no training is needed for our prompting based implementation while achieving better or least comparable performance. Contribution FILM++ to fill the role of the data efficient baseline. we propose Prompter, which replaces the semantic search module of FILM++ with language prompting, making it even more data efficient. Some key terms Difficulty in converting language into robot controls ...

February 1, 2023 · 3 min · 526 words · Sukai Huang

Kyle_mahowald Dissociating Language and Thought in Large Language Models a Cognitive Perspective 2023

[TOC] Title: Dissociating Language and Thought in Large Language Models a Cognitive Perspective Author: Kyle Mahowald et. al. Publish Year: 16 Jan 2023 Review Date: Tue, Jan 31, 2023 url: https://arxiv.org/pdf/2301.06627.pdf Summary of paper Motivation the author tried to challenge the “good at language $\implies$ good at thought” fallacy. the second fallacy is “bad at thought $\implies$ bad at language” Contribution the author argued that LLMs have promise as scientific models of one piece of the human cognitive toolbox – formal language processing – but fall short of modelling human thought. in section 4, we consider several domains required for functional linguistic competence – formal reasoning, world knowledge, situation modelling and social cognitive abilities Some key terms deep learning models in linguistics ...

January 31, 2023 · 4 min · 776 words · Sukai Huang

Michael_janner Planning With Diffusion for Flexible Behaviour Synthesis 2022

[TOC] Title: Planning With Diffusion for Flexible Behaviour Synthesis Author: Michael Janner et. al. Publish Year: 21 Dec 2022 Review Date: Mon, Jan 30, 2023 Summary of paper Motivation use the diffusion model to learn the dynamics tight coupling of the modelling and planning our goal is to break this abstraction barrier by designing a model and planning algorithm that are trained alongside one another, resulting in a non-autoregressive trajectory-level model for which sampling and planning are nearly identical. Some key terms ideal model-based RL ...

January 30, 2023 · 2 min · 317 words · Sukai Huang

Shailaja_keyur_sampat Reasoning About Actions Over Visual and Linguistic Modalities a Survey 2022

[TOC] Title: Shailaja_keyur_sampat Reasoning About Actions Over Visual and Linguistic Modalities a Survey 2022 Author: Publish Year: Review Date: Fri, Jan 20, 2023 Summary of paper Motivation reasoning about actions & changes has been widely studies in the knowledge representation community, it has recently piqued the interest of NLP and computer vision researchers. Contribution Some key terms Six most frequent types of commonsense knowledge tasks that involve language-based reasoning about actions ...

January 20, 2023 · 3 min · 524 words · Sukai Huang

Xin_wang Reinforced Cross Modal Matching and Self Supervised Imitation Learning for Vision Language Navigation 2019

[TOC] Title: Reinforced Cross Modal Matching and Self Supervised Imitation Learning for Vision Language Navigation 2019 Author: Xin Wang et. al. Publish Year: Review Date: Wed, Jan 18, 2023 Summary of paper Motivation Visual Language Navigation (VLN) presents some unique challenges first, reasoning over images and natural language instructions can be difficult. secondly, except for strictly following expert demonstrations, the feedback is rather coarse, since the “Success” feedback is provided only when the agent reaches a target position (sparse reward) A good “instruction following” trajectory may ended up just stop before you reaching the goal state and then receive zero rewards. existing work suffer from generalisation problem. (need to retrain the agent in new environment) Implementation agent can infer which sub-instruction to focus on and where to look at. (automatic splitting long instruction) with a matching critic that evaluates an executed path by the probability of reconstructing the original instruction from the executed path. P(original instruction | past trajectory) cycle reconstruction: we have P(target trajectory | the instruction) = 1, and we want to measure P(original instruction | past trajectory) this will enhance the interpretability as now you understand how the robot was thinking about

January 18, 2023 · 1 min · 195 words · Sukai Huang

Alekh_agarwal PC-PG Policy Cover Directed Exploration for Provable Policy Gradient Learning 2020

[TOC] Title: PC-PG Policy Cover Directed Exploration for Provable Policy Gradient Learning Author: Alekh Agarwal et. al. Publish Year: Review Date: Wed, Dec 28, 2022 Summary of paper Motivation The primary drawback of direct policy gradient methods is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approach and Q-learning directly handle exploration through the use of optimism. Contribution Policy Cover-Policy Gradient algorithm (PC-PG), a direct, model-free, policy optimisation approach which addresses exploration through the use of a learned ensemble of policies, the latter provides a policy cover over the state space. the use of a learned policy cover address exploration, and also address what is the catastrophic forgetting problem in policy gradient approaches (which use reward bonuses); the on-policy algorithm, where approximation errors due to model mispecification amplify (see [Lu et al., 2018] for discussion) Some key terms suffering from sparse reward ...

December 28, 2022 · 2 min · 271 words · Sukai Huang

Alekh_agarwal on the Theory of Policy Gradient Methods Optimality Approximation and Distribution Shift 2020

[TOC] Title: On the Theory of Policy Gradient Methods Optimality Approximation and Distribution Shift 2020 Author: Alekh Agarwal et. al. Publish Year: 14 Oct 2020 Review Date: Wed, Dec 28, 2022 Summary of paper Motivation little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution and how they cope with approximation error due to using a restricted class of parametric policies. Contribution One central contribution of this work is in providing approximation guarantees that are average case - which avoid explicit worst-case dependencies on the size of state space – by making a formal connection to supervised learning under distribution shift. This characterisation shows an important between estimation error, approximation error and exploration (as characterised through a precisely defined condition number) Some key terms basic theoretical convergence questions ...

December 28, 2022 · 3 min · 557 words · Sukai Huang

Chloe_ching_yun_hsu Revisiting Design Choices in Proximal Policy Optimisation 2020

[TOC] Title: Revisiting Design Choices in Proximal Policy Optimisation Author: Chloe Ching-Yun Hsu et. al. Publish Year: 23 Sep 2020 Review Date: Wed, Dec 28, 2022 Summary of paper Motivation Contribution on discrete action space with sparse high rewards, standard PPO often gets stuck at suboptimal actions. Why analyze the reason fort these failure modes and explain why they are not exposed by standard benchmarks In summary, our study suggests that Beta policy parameterization and KL-regularized objectives should be reconsidered for PPO, especially when alternatives improves PPO in all settings. The author proved the convergence guarantee for PPO-KL penalty version, as it inherits convergence guarantees of mirror descent for policy families that are closed under mixture Some key terms design choices ...

December 28, 2022 · 3 min · 467 words · Sukai Huang