Damai Dai Deepseekmoe 2024

[TOC] Title: DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture of Experts Language Models Author: Damai Dai et. al. Publish Year: 11 Jan 2024 Review Date: Sat, Jun 22, 2024 url: https://arxiv.org/pdf/2401.06066 Summary of paper Motivation conventional MoE architecture like GShard, which avtivate top-k out of N experts, face challenges in ensuring expert specialization, i.e., each expert acquires non-overlapping and focused knowledge, in response, we propose DeepSeekMoE architecture towards ultimate expert specialization Contribution segmenting expert into mN ones and activating mK from them isolating K_s, experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts Some key terms MoE architecture...

<span title='2024-06-22 11:13:50 +1000 AEST'>June 22, 2024</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;582 words&nbsp;·&nbsp;Sukai Huang

Jessy Lin Learning to Model the World With Language 2024

[TOC] Title: Learning to Model the World With Language 2024 Author: Jessy Lin et. al. Publish Year: ICML 2024 Review Date: Fri, Jun 21, 2024 url: https://arxiv.org/abs/2308.01399 Summary of paper Motivation in this work, we propose that agents can ground diverse kinds of language by using it to predict the future in contrast to directly predicting what to do with a language-conditioned policy, Dynalang decouples learning to model the world with language (supervised learning with prediction objectives) from learning to act given that model (RL with task rewards) Future prediction provides a rich grounding signal for learning what language utterances mean, which in turn equip the agent with a richer understanding of the world to solve complex tasks....

<span title='2024-06-21 11:47:25 +1000 AEST'>June 21, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;381 words&nbsp;·&nbsp;Sukai Huang

Verification in Llm Topic 2024

[TOC] Review Date: Thu, Jun 20, 2024 Verification in LLM Topic 2024 Paper 1: Weng, Yixuan, et al. “Large language models are better reasoners with self-verification.” arXiv preprint arXiv:2212.09561 (2022). the better reasoning with CoT is carried out in the following two steps, Forward Reasoning ad Backward Verification. Specifically, in Forward Reasoning, LLM reasoners generate candidate answers using CoT, and the question and candidate answers form different conclusions to be verified....

<span title='2024-06-20 20:19:12 +1000 AEST'>June 20, 2024</span>&nbsp;·&nbsp;1 min&nbsp;·&nbsp;110 words&nbsp;·&nbsp;Sukai Huang

Jiuzhou Reward Engineering for Generating Semi Structured Explan 2023

[TOC] Title: Reward Engineering for Generating Semi-Structured Explanation Author: Jiuzhou Han et. al. Publish Year: EACL2024 Review Date: Thu, Jun 20, 2024 url: https://github.com/Jiuzhouh/Reward-Engineering-for-Generating-SEG Summary of paper Motivation Contribution the objective is to equip moderately-sized LMs with the ability to not only provide answers but also generate structured explanations Some key terms Intro the author talked about some background on Cui et al. incorporate a generative pre-training mechanism over synthetic graphs by aligning inputs pairs of text-graph to improve the model’s capability in generating semi-structured explanation....

<span title='2024-06-20 14:11:32 +1000 AEST'>June 20, 2024</span>&nbsp;·&nbsp;1 min&nbsp;·&nbsp;162 words&nbsp;·&nbsp;Sukai Huang

Jiuzhou Towards Uncertainty Aware Lang Agent 2024

[TOC] Title: Towards Uncertainty Aware Language Agent Author: Jiuzhou Han et. al. Publish Year: 30 May 2024 Review Date: Thu, Jun 20, 2024 url: arXiv:2401.14016v3 Summary of paper Motivation The existing approaches neglect the notion of uncertainty during these interactions Contribution Some key terms Related work 1: lang agent the author define what is language agent and discuss it – the prominent work of ReAct propose a general language agent framework to combine reasoning and acting with LLMs for solving diverse language reasoning tasks....

<span title='2024-06-20 11:15:18 +1000 AEST'>June 20, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;295 words&nbsp;·&nbsp;Sukai Huang

Silviu Pitis Failure Modes of Learning Reward Models for Sequence Model 2023

[TOC] Title: Failure Modes of Learning Reward Models for LLMs and other Sequence Models Author: Silviu Pitis Publish Year: ICML workshop 2023 Review Date: Fri, May 10, 2024 url: https://openreview.net/forum?id=NjOoxFRZA4¬eId=niZsZfTPPt Summary of paper C3. Preference cannot represented as numbers M1. rationality level of human preference 3.2, if the condition/context changes, the preference may change rapidly, and this cannot reflect on the reward machine A2. Preference should be expressed with respect to state-policy pairs, rather than just outcomes A state-policy pair includes both the current state of the system and the strategy (policy) being employed....

<span title='2024-05-10 22:23:31 +1000 AEST'>May 10, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;312 words&nbsp;·&nbsp;Sukai Huang

Gaurav Ghosal the Effect of Modeling Human Rationality Level 2023

[TOC] Title: The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types Author: Gaurav R. Ghosal et. al. Publish Year: 9 Mar 2023 AAAI 2023 Review Date: Fri, May 10, 2024 url: arXiv:2208.10687v2 Summary of paper Contribution We find that overestimating human rationality can have dire effects on reward learning accuracy and regret We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases Some key terms What is Boltzmann Rationality coefficient $\beta$...

<span title='2024-05-10 19:35:03 +1000 AEST'>May 10, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;312 words&nbsp;·&nbsp;Sukai Huang

Nate Rahn Policy Optimization in Noisy Neighbourhood 2023

[TOC] Title: Policy Optimization in Noisy Neighborhood Author: Nate Rahn et. al. Publish Year: NeruIPS 2023 Review Date: Fri, May 10, 2024 url: https://arxiv.org/abs/2309.14597 Summary of paper Contribution in this paper, we demonstrate that high-frequency discontinuities in the mapping from policy parameters $\theta$ to return $R(\theta)$​ are an important cause of return variation. As a consequence of these discontinuities, a single gradient step or perturbation to the policy parameters often causes important changes in the return, even in settings where both the policy and the dynamics are deterministic....

<span title='2024-05-10 14:16:56 +1000 AEST'>May 10, 2024</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;510 words&nbsp;·&nbsp;Sukai Huang

Ademi Adeniji Language Reward Modulation for Pretraining Rl 2023

[TOC] Title: Language Reward Modulation for Pretraining Reinforcement Learning Author: Ademi Adeniji et. al. Publish Year: ICLR 2023 reject Review Date: Thu, May 9, 2024 url: https://openreview.net/forum?id=SWRFC2EupO Summary of paper Motivation Learned reward function (LRF) are notorious for noise and reward misspecification errors which can render them highly unreliable for learning robust policies with RL due to issues of reward exploitation and noisy models that these LRF’s are ill-suited for directly learning downstream tasks....

<span title='2024-05-09 21:18:00 +1000 AEST'>May 9, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;338 words&nbsp;·&nbsp;Sukai Huang

Thomas Coste Reward Model Ensembles Help Mitigate Overoptimization 2024

[TOC] Title: Reward Model Ensembles Help Mitigate Overoptimization Author: Thomas Coste et. al. Publish Year: 10 Mar 2024 Review Date: Thu, May 9, 2024 url: arXiv:2310.02743v2 Summary of paper Motivation however, as imperfect representation of the “true” reward, these learned reward models are susceptible to over-optimization. Contribution the author conducted a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specially worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization the author additionally extend the setup to include 25% label noise to better mirror real-world conditions For PPO, ensemble-based conservative optimization always reduce overoptimization and outperforms single reward model optimization Some key terms Overoptimization...

<span title='2024-05-09 14:06:33 +1000 AEST'>May 9, 2024</span>&nbsp;·&nbsp;1 min&nbsp;·&nbsp;205 words&nbsp;·&nbsp;Sukai Huang

Mengdi Li Internally Rewarded Rl 2023

[TOC] Title: Internally Rewarded Reinforcement Learning Author: Mengdi Li et. al. Publish Year: 2023 PMLR Review Date: Wed, May 8, 2024 url: https://proceedings.mlr.press/v202/li23ax.html Summary of paper Motivation the author studied a class o RL problem where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy (parallel training on both the policy and the reward model) this leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning , and conversely, an under-optimized policy impedes discriminator learning we call this learning setting Internally Rewarded RL (IRRL) as the reward is not provided directly by the environment but internally by the discriminator....

<span title='2024-05-08 14:59:15 +1000 AEST'>May 8, 2024</span>&nbsp;·&nbsp;4 min&nbsp;·&nbsp;682 words&nbsp;·&nbsp;Sukai Huang

Xuran Pan on the Integration of Self Attention and Convolution 2022

[TOC] Title: On the Integration of Self-Attention and Convolution Author: Xuran Pan et. al. Publish Year: 2022 IEEE Review Date: Thu, Apr 25, 2024 url: https://arxiv.org/abs/2111.14556 Summary of paper Motivation there exists a strong underlying relation between convolution and self-attention. Related work Convolution NN it uses convolution kernels to extract local features, have become the most powerful and conventional technique for various vision tasks Self-attention only Recently, vision transformer shows that given enough data, we can treat an image as a sequence of 256 tokens and leverage Transformer models to achieve competitive results in image recognition....

<span title='2024-04-25 17:53:46 +1000 AEST'>April 25, 2024</span>&nbsp;·&nbsp;1 min&nbsp;·&nbsp;147 words&nbsp;·&nbsp;Sukai Huang

Recent Language Model Technique 2024

[TOC] Title: Recent Language Model Technique 2024 Review Date: Thu, Apr 25, 2024 url: https://www.youtube.com/watch?v=kzB23CoZG30 url2: https://www.youtube.com/watch?v=iH-wmtxHunk url3: https://www.youtube.com/watch?v=o68RRGxAtDo LLama 3 key modification: grouped query attention (GQA) key instruction-tuning process: Their approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO). The quality of the prompts that are used in SFT and the preference rankings that are used in PPO and DPO has an outsized influence on the performance of aligned models....

<span title='2024-04-25 12:49:03 +1000 AEST'>April 25, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;332 words&nbsp;·&nbsp;Sukai Huang

Thomas Carta Grounding Llms in Rl 2023

[TOC] Title: Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning Author: Thomas Carta el. al. Publish Year: 6 Sep 2023 Review Date: Tue, Apr 23, 2024 url: arXiv:2302.02662v3 Summary of paper Summary The author considered an agent using an LLM as a policy that is progressively updated as the agent interacts with the environment, leveraging online reinforcement learning to improve its performance to solve goals (under the RL paradigm environment (MDP))...

<span title='2024-04-23 13:20:22 +1000 AEST'>April 23, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;242 words&nbsp;·&nbsp;Sukai Huang

Daniel Hierarchies of Reward Machines 2023

[TOC] Title: Hierarchies of Reward Machines Author: Daniel Furelos-Blanco et. al. Publish Year: 4 Jun 2023 Review Date: Fri, Apr 12, 2024 url: https://arxiv.org/abs/2205.15752 Summary of paper Motivation Finite state machine are a simple yet powerful formalism for abstractly representing temporal tasks in a structured manner. Contribution The work introduces Hierarchies of Reinforcement Models (HRMs) to enhance the abstraction power of existing models. Key contributions include: HRM Abstraction Power: HRMs allow for the creation of hierarchies of Reinforcement Models (RMs), enabling constituent RMs to call other RMs....

<span title='2024-04-12 15:12:54 +1000 AEST'>April 12, 2024</span>&nbsp;·&nbsp;5 min&nbsp;·&nbsp;965 words&nbsp;·&nbsp;Sukai Huang

Shanchuan Efficient N Robust Exploration Through Discriminative Ir 2023

[TOC] Title: DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards Author: Shanchuan Wan et. al. Publish Year: 18 May 2023 Review Date: Fri, Apr 12, 2024 url: https://arxiv.org/abs/2304.10770 Summary of paper Motivation Recent studies have shown the effectiveness of encouraging exploration with intrinsic rewards estimated from novelties in observations However, there is a gap between the novelty of an observation and an exploration, as both the stochasticity in the environment and agent’s behaviour may affect the observation....

<span title='2024-04-12 15:07:58 +1000 AEST'>April 12, 2024</span>&nbsp;·&nbsp;9 min&nbsp;·&nbsp;1795 words&nbsp;·&nbsp;Sukai Huang

Discover Hierarchical Achieve in Rl via Cl 2023

[TOC] Title: Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning Author: Seungyong Moon et. al. Publish Year: 2 Nov 2023 Review Date: Tue, Apr 2, 2024 url: https://arxiv.org/abs/2307.03486 Summary of paper Contribution PPO agents demonstrate some ability to predict future achievements. Leveraging this observation, a novel contrastive learning method called achievement distillation is introduced, enhancing the agent’s predictive abilities. This approach excels at discovering hierarchical achievements, Some key terms Model based and explicit module in previous studies are not that good...

<span title='2024-04-02 21:02:37 +1100 AEDT'>April 2, 2024</span>&nbsp;·&nbsp;5 min&nbsp;·&nbsp;1047 words&nbsp;·&nbsp;Sukai Huang

Jia Li Structured Cot Prompting for Code Generation 2023

[TOC] Title: Structured Chaint of Thought Prompting for Code Generation 2023 Author: Jia Li et. al. Publish Year: 7 Sep 2023 Review Date: Wed, Feb 28, 2024 url: https://arxiv.org/pdf/2305.06599.pdf Summary of paper Contribution The paper introduces Structured CoTs (SCoTs) and a novel prompting technique called SCoT prompting for improving code generation with Large Language Models (LLMs) like ChatGPT and Codex. Unlike the previous Chain-of-Thought (CoT) prompting, which focuses on natural language reasoning steps, SCoT prompting leverages the structural information inherent in source code....

<span title='2024-02-28 19:59:38 +1100 AEDT'>February 28, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;381 words&nbsp;·&nbsp;Sukai Huang

Stephanie Teaching Models to Express Their Uncertainty in Words 2022

[TOC] Title: Teaching Models to Express Their Uncertainty in Words Author: Stephanie Lin et. al. Publish Year: 13 Jun 2022 Review Date: Wed, Feb 28, 2024 url: https://arxiv.org/pdf/2205.14334.pdf Summary of paper Motivation The study demonstrates that a GPT-3 model can articulate uncertainty about its answers in natural language without relying on model logits. It generates both an answer and a confidence level (e.g., “90% confidence” or “high confidence”), which map to well-calibrated probabilities....

<span title='2024-02-28 16:12:53 +1100 AEDT'>February 28, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;327 words&nbsp;·&nbsp;Sukai Huang

Gwenyth Estimating Confidence of Llm by Prompt Agreement 2023

[TOC] Title: Strength in Numbers: Estimating Confidence of Large Language Models by Prompt Agreement Author: Gwenyth Portillo Wightman et. al. Publish Year: TrustNLP 2023 Review Date: Tue, Feb 27, 2024 url: https://aclanthology.org/2023.trustnlp-1.28.pdf Summary of paper Motivation while traditional classifiers produce scores for each label, language models instead produce scores for the generation which may not be well calibrated. the authors proposed a method that involves comparing generated outputs across diverse prompts to create confidence score....

<span title='2024-02-27 15:44:06 +1100 AEDT'>February 27, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;393 words&nbsp;·&nbsp;Sukai Huang

Sudhir Agarwal Translate Infer Compile for Accurate Text to Plan 2024

[TOC] Title: TIC: Translate-Infer-Compile for accurate “text to plan” using LLMs and logical intermediate representations Author: Sudhir Agarwal et. al. Publish Year: Jan 2024 Review Date: Sat, Feb 17, 2024 url: https://arxiv.org/pdf/2402.06608.pdf Summary of paper Motivation using an LLM to generate the task PDDL from a natural language planning task descriptions is challenging. One of the primary reasons for failure is that the LLM often make errors generating information that must abide by the constraints specified in the domain knowledge or the task descriptions...

<span title='2024-02-17 12:56:25 +1100 AEDT'>February 17, 2024</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;639 words&nbsp;·&nbsp;Sukai Huang

Philip Cohen Intention Is Choice With Commitment 1990

[TOC] Title: Intention Is Choice With Commitment Author: Philip Cohen et. al. Publish Year: 1990 Review Date: Tue, Jan 30, 2024 url: https://www.sciencedirect.com/science/article/pii/0004370290900555 Summary of paper Contribution This paper delves into the principles governing the rational balance between an agent’s beliefs, goals, actions, and intentions, offering valuable insights for both artificial agents and a theory of human action. It focuses on clarifying when an agent can abandon their goals and how strongly they are committed to these goals....

<span title='2024-01-30 23:17:51 +1100 AEDT'>January 30, 2024</span>&nbsp;·&nbsp;4 min&nbsp;·&nbsp;752 words&nbsp;·&nbsp;Sukai Huang

Christian Muise Planning for Goal Oriented Dialgue Systems 2019

[TOC] Title: Christian Muise Planning for Goal Oriented Dialgue Systems 2019 Author: Publish Year: Review Date: Tue, Jan 30, 2024 url: arXiv:1910.08137v1 Summary of paper Motivation there is increasing demand for dialogue agents capable of handling specific tasks and interactions in a business context Contribution the author propose a new approach that eliminates the need for manual specification of dialogue trees, a common practice in existing systems. they suggest using a declarative representation of the dialogue agent, which can be processed by advanced planning tech (tree -> planning) The paper introduces a paradigm shift in specifying complex dialogue agents by recognizing that many aspects of these agents share similarities or identical underlying processes....

<span title='2024-01-30 16:58:06 +1100 AEDT'>January 30, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;416 words&nbsp;·&nbsp;Sukai Huang

Vishal Pallagani Llm N Planning Survey 2024

[TOC] Title: “On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS).” Author: Pallagani, Vishal, et al. Publish Year: arXiv preprint arXiv:2401.02500 (2024). Review Date: Mon, Jan 29, 2024 url: Summary of paper Contribution The paper provides a comprehensive review of 126 papers focusing on the integration of Large Language Models (LLMs) within Automated Planning and Scheduling, a growing area in Artificial Intelligence (AI). It identifies eight categories where LLMs are applied in addressing various aspects of planning problems:...

<span title='2024-01-29 23:02:47 +1100 AEDT'>January 29, 2024</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;546 words&nbsp;·&nbsp;Sukai Huang

Ishika Singh Progprompt Program Generation for Robot Task Planning 2023

[TOC] Title: ProgPrompt: program generation for situated robot task planning using large language models Author: Ishika Singh et. al. Publish Year: 28 August 2023 Review Date: Mon, Jan 29, 2024 url: https://progprompt.github.io/ Summary of paper Motivation Classical Task planning requires myriad domain knowledge large serach space, hard toscale domain specific require concrete goal specification Planning with LLMs LLM is not situated in the scene Plan steps using unavailable actions and objects Text-to-robot action mapping may not be trivial combinatorial admissible action space....

<span title='2024-01-29 20:45:59 +1100 AEDT'>January 29, 2024</span>&nbsp;·&nbsp;1 min&nbsp;·&nbsp;101 words&nbsp;·&nbsp;Sukai Huang