Pipeline Architecture

Pallagani Plansformer Generating Plans 2023

[TOC] Title: Pallagani Plansformer Generating Plans 2023 Author: Pallagani, Vishal et. al. Publish Year: GenPlan 2023 Workshop Review Date: Tue, Dec 24, 2024 url: https://arxiw.org/pdf/2212.08681 1 2 3 4 5 6 7 8 9 10 11 12 13 # input bibtex here @InProceedings{pallagani2023plansformer, author = {Pallagani, Vishal and Muppasani, Bharath and Murugesan, Keerthiram and Rossi, Francesca and Horesh, Lior and Srivastava, Biplav and Fabiano, Francesco and Loreggia, Andrea}, title = {Plansformer: Generating Symbolic Plans using Transformers}, booktitle = {Seventh Workshop on Generalization in Planning (GenPlan 2023)}, year = {2023}, month = {December}, address = {New Orleans, USA}, venue = {Room 238-239, New Orleans Ernest N. Morial Convention Center} } Pallagani, Vishal, et al. "Plansformer: Generating Symbolic Plans using Transformers." NeurIPS 2023 Workshop on Generalization in Planning. [!Note] ...

December 24, 2024 · 4 min · 701 words · Sukai Huang

Damai Dai Deepseekmoe 2024

[TOC] Title: DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture of Experts Language Models Author: Damai Dai et. al. Publish Year: 11 Jan 2024 Review Date: Sat, Jun 22, 2024 url: https://arxiv.org/pdf/2401.06066 Summary of paper Motivation conventional MoE architecture like GShard, which avtivate top-k out of N experts, face challenges in ensuring expert specialization, i.e., each expert acquires non-overlapping and focused knowledge, in response, we propose DeepSeekMoE architecture towards ultimate expert specialization Contribution segmenting expert into mN ones and activating mK from them isolating K_s, experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts Some key terms MoE architecture ...

June 22, 2024 · 3 min · 582 words · Sukai Huang

Jessy Lin Learning to Model the World With Language 2024

[TOC] Title: Learning to Model the World With Language 2024 Author: Jessy Lin et. al. Publish Year: ICML 2024 Review Date: Fri, Jun 21, 2024 url: https://arxiv.org/abs/2308.01399 Summary of paper Motivation in this work, we propose that agents can ground diverse kinds of language by using it to predict the future in contrast to directly predicting what to do with a language-conditioned policy, Dynalang decouples learning to model the world with language (supervised learning with prediction objectives) from learning to act given that model (RL with task rewards) Future prediction provides a rich grounding signal for learning what language utterances mean, which in turn equip the agent with a richer understanding of the world to solve complex tasks. Contribution investigate whether learning language-conditioned world models enable agents to scale to more diverse language use, compared to language-conditioned policies. Some key terms related work ...

June 21, 2024 · 2 min · 381 words · Sukai Huang

Verification in Llm Topic 2024

[TOC] Review Date: Thu, Jun 20, 2024 Verification in LLM Topic 2024 Paper 1: Weng, Yixuan, et al. “Large language models are better reasoners with self-verification.” arXiv preprint arXiv:2212.09561 (2022). the better reasoning with CoT is carried out in the following two steps, Forward Reasoning ad Backward Verification. Specifically, in Forward Reasoning, LLM reasoners generate candidate answers using CoT, and the question and candidate answers form different conclusions to be verified. And in Backward Verification, We mask the original condition and predict its result using another CoT. We rank candidate conclusions based on a verification score, which is calculated by assessing the consistency between the predicted and original condition values ...

June 20, 2024 · 1 min · 110 words · Sukai Huang

Jiuzhou Reward Engineering for Generating Semi Structured Explan 2023

[TOC] Title: Reward Engineering for Generating Semi-Structured Explanation Author: Jiuzhou Han et. al. Publish Year: EACL2024 Review Date: Thu, Jun 20, 2024 url: https://github.com/Jiuzhouh/Reward-Engineering-for-Generating-SEG Summary of paper Motivation Contribution the objective is to equip moderately-sized LMs with the ability to not only provide answers but also generate structured explanations Some key terms Intro the author talked about some background on Cui et al. incorporate a generative pre-training mechanism over synthetic graphs by aligning inputs pairs of text-graph to improve the model’s capability in generating semi-structured explanation. ...

June 20, 2024 · 1 min · 162 words · Sukai Huang

Jiuzhou Towards Uncertainty Aware Lang Agent 2024

[TOC] Title: Towards Uncertainty Aware Language Agent Author: Jiuzhou Han et. al. Publish Year: 30 May 2024 Review Date: Thu, Jun 20, 2024 url: arXiv:2401.14016v3 Summary of paper Motivation The existing approaches neglect the notion of uncertainty during these interactions Contribution Some key terms Related work 1: lang agent the author define what is language agent and discuss it – the prominent work of ReAct propose a general language agent framework to combine reasoning and acting with LLMs for solving diverse language reasoning tasks. continue the track of ReAct – introducing Reflexion, use the history failure trials as input to ask for reflection and can gain better results FireAct – add more diverse fine-tuning data to improve the performance Later the author mention Toolformer, Gorilla and other lang agent that is not start from ReAct ...

June 20, 2024 · 2 min · 295 words · Sukai Huang

Silviu Pitis Failure Modes of Learning Reward Models for Sequence Model 2023

[TOC] Title: Failure Modes of Learning Reward Models for LLMs and other Sequence Models Author: Silviu Pitis Publish Year: ICML workshop 2023 Review Date: Fri, May 10, 2024 url: https://openreview.net/forum?id=NjOoxFRZA4¬eId=niZsZfTPPt Summary of paper C3. Preference cannot represented as numbers M1. rationality level of human preference 3.2, if the condition/context changes, the preference may change rapidly, and this cannot reflect on the reward machine A2. Preference should be expressed with respect to state-policy pairs, rather than just outcomes A state-policy pair includes both the current state of the system and the strategy (policy) being employed. This approach avoids the complication of unresolved stochasticity (randomness that hasn’t yet been resolved), focusing instead on scenarios where the outcomes of policies are already known. Example with Texas Hold’em: The author uses an example from poker to illustrate these concepts. In the example, a player holding a weaker hand (72o) wins against a stronger hand (AA) after both commit to large bets pre-flop. Traditional reward modeling would prefer the successful trajectory of the weaker hand due to the positive outcome. However, a rational analysis (ignoring stochastic outcomes) would prefer the decision-making associated with the stronger hand (AA), even though it lost, as it’s typically the better strategy. ...

May 10, 2024 · 2 min · 312 words · Sukai Huang

Gaurav Ghosal the Effect of Modeling Human Rationality Level 2023

[TOC] Title: The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types Author: Gaurav R. Ghosal et. al. Publish Year: 9 Mar 2023 AAAI 2023 Review Date: Fri, May 10, 2024 url: arXiv:2208.10687v2 Summary of paper Contribution We find that overestimating human rationality can have dire effects on reward learning accuracy and regret We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases Some key terms What is Boltzmann Rationality coefficient $\beta$ ...

May 10, 2024 · 2 min · 312 words · Sukai Huang

Nate Rahn Policy Optimization in Noisy Neighbourhood 2023

[TOC] Title: Policy Optimization in Noisy Neighborhood Author: Nate Rahn et. al. Publish Year: NeruIPS 2023 Review Date: Fri, May 10, 2024 url: https://arxiv.org/abs/2309.14597 Summary of paper Contribution in this paper, we demonstrate that high-frequency discontinuities in the mapping from policy parameters $\theta$ to return $R(\theta)$​ are an important cause of return variation. As a consequence of these discontinuities, a single gradient step or perturbation to the policy parameters often causes important changes in the return, even in settings where both the policy and the dynamics are deterministic. unstable learning in some sense based on this observation, we demonstrate the usefulness of studying the landscape through the distribution of returns obtained from small perturbation of $\theta$ Some key terms Evidence that noisy reward signal leads to substantial variance in performance ...

May 10, 2024 · 3 min · 510 words · Sukai Huang

Ademi Adeniji Language Reward Modulation for Pretraining Rl 2023

[TOC] Title: Language Reward Modulation for Pretraining Reinforcement Learning Author: Ademi Adeniji et. al. Publish Year: ICLR 2023 reject Review Date: Thu, May 9, 2024 url: https://openreview.net/forum?id=SWRFC2EupO Summary of paper Motivation Learned reward function (LRF) are notorious for noise and reward misspecification errors which can render them highly unreliable for learning robust policies with RL due to issues of reward exploitation and noisy models that these LRF’s are ill-suited for directly learning downstream tasks. Generalization ability issue of multi-modal vision and language model (VLM) ...

May 9, 2024 · 2 min · 338 words · Sukai Huang

Thomas Coste Reward Model Ensembles Help Mitigate Overoptimization 2024

[TOC] Title: Reward Model Ensembles Help Mitigate Overoptimization Author: Thomas Coste et. al. Publish Year: 10 Mar 2024 Review Date: Thu, May 9, 2024 url: arXiv:2310.02743v2 Summary of paper Motivation however, as imperfect representation of the “true” reward, these learned reward models are susceptible to over-optimization. Contribution the author conducted a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specially worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization the author additionally extend the setup to include 25% label noise to better mirror real-world conditions For PPO, ensemble-based conservative optimization always reduce overoptimization and outperforms single reward model optimization Some key terms Overoptimization ...

May 9, 2024 · 1 min · 205 words · Sukai Huang

Mengdi Li Internally Rewarded Rl 2023

[TOC] Title: Internally Rewarded Reinforcement Learning Author: Mengdi Li et. al. Publish Year: 2023 PMLR Review Date: Wed, May 8, 2024 url: https://proceedings.mlr.press/v202/li23ax.html Summary of paper Motivation the author studied a class o RL problem where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy (parallel training on both the policy and the reward model) this leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning , and conversely, an under-optimized policy impedes discriminator learning we call this learning setting Internally Rewarded RL (IRRL) as the reward is not provided directly by the environment but internally by the discriminator. Contribution proposed the clipped linear reward function. Results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance. we formulate a class of RL problems as IRRL, and formulate the inherent issues of noisy rewards that leads to an unstable training loop in IRRL we empirically characterize the noise in the discriminator and derive the effect of the reward function in reducing the bias of the estimated reward and the variance of reward noise from an underdeveloped discriminator Comment: the author tried to express the bias and variance of reward noises in Taylor approximation propose clipped linear reward function Some key terms Simultaneous optimization causes suboptimal training ...

May 8, 2024 · 4 min · 682 words · Sukai Huang

Xuran Pan on the Integration of Self Attention and Convolution 2022

[TOC] Title: On the Integration of Self-Attention and Convolution Author: Xuran Pan et. al. Publish Year: 2022 IEEE Review Date: Thu, Apr 25, 2024 url: https://arxiv.org/abs/2111.14556 Summary of paper Motivation there exists a strong underlying relation between convolution and self-attention. Related work Convolution NN it uses convolution kernels to extract local features, have become the most powerful and conventional technique for various vision tasks Self-attention only Recently, vision transformer shows that given enough data, we can treat an image as a sequence of 256 tokens and leverage Transformer models to achieve competitive results in image recognition. Attention enhanced convolution ...

April 25, 2024 · 1 min · 147 words · Sukai Huang

Recent Language Model Technique 2024

[TOC] Title: Recent Language Model Technique 2024 Review Date: Thu, Apr 25, 2024 url: https://www.youtube.com/watch?v=kzB23CoZG30 url2: https://www.youtube.com/watch?v=iH-wmtxHunk url3: https://www.youtube.com/watch?v=o68RRGxAtDo LLama 3 key modification: grouped query attention (GQA) key instruction-tuning process: Their approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO). The quality of the prompts that are used in SFT and the preference rankings that are used in PPO and DPO has an outsized influence on the performance of aligned models. fine-tuning tool: torchtune ...

April 25, 2024 · 2 min · 332 words · Sukai Huang

Thomas Carta Grounding Llms in Rl 2023

[TOC] Title: Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning Author: Thomas Carta el. al. Publish Year: 6 Sep 2023 Review Date: Tue, Apr 23, 2024 url: arXiv:2302.02662v3 Summary of paper Summary The author considered an agent using an LLM as a policy that is progressively updated as the agent interacts with the environment, leveraging online reinforcement learning to improve its performance to solve goals (under the RL paradigm environment (MDP)) ...

April 23, 2024 · 2 min · 242 words · Sukai Huang

Daniel Hierarchies of Reward Machines 2023

[TOC] Title: Hierarchies of Reward Machines Author: Daniel Furelos-Blanco et. al. Publish Year: 4 Jun 2023 Review Date: Fri, Apr 12, 2024 url: https://arxiv.org/abs/2205.15752 Summary of paper Motivation Finite state machine are a simple yet powerful formalism for abstractly representing temporal tasks in a structured manner. Contribution The work introduces Hierarchies of Reinforcement Models (HRMs) to enhance the abstraction power of existing models. Key contributions include: HRM Abstraction Power: HRMs allow for the creation of hierarchies of Reinforcement Models (RMs), enabling constituent RMs to call other RMs. It’s proven that any HRM can be converted into an equivalent flat HRM with identical behavior. However, the equivalent flat HRM can have significantly more states and edges, especially under specific conditions. ...

April 12, 2024 · 5 min · 965 words · Sukai Huang

Shanchuan Efficient N Robust Exploration Through Discriminative Ir 2023

[TOC] Title: DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards Author: Shanchuan Wan et. al. Publish Year: 18 May 2023 Review Date: Fri, Apr 12, 2024 url: https://arxiv.org/abs/2304.10770 Summary of paper Motivation Recent studies have shown the effectiveness of encouraging exploration with intrinsic rewards estimated from novelties in observations However, there is a gap between the novelty of an observation and an exploration, as both the stochasticity in the environment and agent’s behaviour may affect the observation. Contribution we propose DEIR, a novel method in which we theoretically derive an intrinsic reward with a conditional mutual information term that principally scales with the novelty contributed by agent explorations, and then implement the reward with a discriminative forward model. want to design a novel intrinsic reward design that considers not only the observed novelty but also the effective contribution brought by the agent. Some key terms internal rewards ...

April 12, 2024 · 9 min · 1795 words · Sukai Huang

Discover Hierarchical Achieve in Rl via Cl 2023

[TOC] Title: Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning Author: Seungyong Moon et. al. Publish Year: 2 Nov 2023 Review Date: Tue, Apr 2, 2024 url: https://arxiv.org/abs/2307.03486 Summary of paper Contribution PPO agents demonstrate some ability to predict future achievements. Leveraging this observation, a novel contrastive learning method called achievement distillation is introduced, enhancing the agent’s predictive abilities. This approach excels at discovering hierarchical achievements, Some key terms Model based and explicit module in previous studies are not that good ...

April 2, 2024 · 5 min · 1047 words · Sukai Huang

Jia Li Structured Cot Prompting for Code Generation 2023

[TOC] Title: Structured Chaint of Thought Prompting for Code Generation 2023 Author: Jia Li et. al. Publish Year: 7 Sep 2023 Review Date: Wed, Feb 28, 2024 url: https://arxiv.org/pdf/2305.06599.pdf Summary of paper Contribution The paper introduces Structured CoTs (SCoTs) and a novel prompting technique called SCoT prompting for improving code generation with Large Language Models (LLMs) like ChatGPT and Codex. Unlike the previous Chain-of-Thought (CoT) prompting, which focuses on natural language reasoning steps, SCoT prompting leverages the structural information inherent in source code. By incorporating program structures (sequence, branch, and loop structures) into intermediate reasoning steps (SCoTs), LLMs are guided to generate more structured and accurate code. Evaluation on three benchmarks demonstrates that SCoT prompting outperforms CoT prompting by up to 13.79% in Pass@1, is preferred by human developers in terms of program quality, and exhibits robustness to various examples, leading to substantial improvements in code generation performance. ...

February 28, 2024 · 2 min · 381 words · Sukai Huang

Stephanie Teaching Models to Express Their Uncertainty in Words 2022

[TOC] Title: Teaching Models to Express Their Uncertainty in Words Author: Stephanie Lin et. al. Publish Year: 13 Jun 2022 Review Date: Wed, Feb 28, 2024 url: https://arxiv.org/pdf/2205.14334.pdf Summary of paper Motivation The study demonstrates that a GPT-3 model can articulate uncertainty about its answers in natural language without relying on model logits. It generates both an answer and a confidence level (e.g., “90% confidence” or “high confidence”), which map to well-calibrated probabilities. The model maintains moderate calibration even under distribution shift and shows sensitivity to uncertainty in its answers rather than mimicking human examples. ...

February 28, 2024 · 2 min · 327 words · Sukai Huang

Gwenyth Estimating Confidence of Llm by Prompt Agreement 2023

[TOC] Title: Strength in Numbers: Estimating Confidence of Large Language Models by Prompt Agreement Author: Gwenyth Portillo Wightman et. al. Publish Year: TrustNLP 2023 Review Date: Tue, Feb 27, 2024 url: https://aclanthology.org/2023.trustnlp-1.28.pdf Summary of paper Motivation while traditional classifiers produce scores for each label, language models instead produce scores for the generation which may not be well calibrated. the authors proposed a method that involves comparing generated outputs across diverse prompts to create confidence score. By utilizing multiple prompts, they aim to obtain more precise confidence estimates, using response diversity as a measure of confidence. Contribution The results show that this method produces more calibrated confidence estimates compared to the log probability of the answer to a single prompt, which could be valuable for users relying on prediction confidence in larger systems or decision-making processes. in one sentence: try multiple times, get the mean, mean is more robust and consistent. Some key terms calibrated confidence score ...

February 27, 2024 · 2 min · 393 words · Sukai Huang

Sudhir Agarwal Translate Infer Compile for Accurate Text to Plan 2024

[TOC] Title: TIC: Translate-Infer-Compile for accurate “text to plan” using LLMs and logical intermediate representations Author: Sudhir Agarwal et. al. Publish Year: Jan 2024 Review Date: Sat, Feb 17, 2024 url: https://arxiv.org/pdf/2402.06608.pdf Summary of paper Motivation using an LLM to generate the task PDDL from a natural language planning task descriptions is challenging. One of the primary reasons for failure is that the LLM often make errors generating information that must abide by the constraints specified in the domain knowledge or the task descriptions ...

February 17, 2024 · 3 min · 639 words · Sukai Huang

Philip Cohen Intention Is Choice With Commitment 1990

[TOC] Title: Intention Is Choice With Commitment Author: Philip Cohen et. al. Publish Year: 1990 Review Date: Tue, Jan 30, 2024 url: https://www.sciencedirect.com/science/article/pii/0004370290900555 Summary of paper Contribution This paper delves into the principles governing the rational balance between an agent’s beliefs, goals, actions, and intentions, offering valuable insights for both artificial agents and a theory of human action. It focuses on clarifying when an agent can abandon their goals and how strongly they are committed to these goals. The formalism used in the paper captures several crucial aspects of intentions, including an analysis of Bratman’s three characteristic functional roles of intentions and how agents can avoid intending all the unintended consequences of their actions. Furthermore, the paper discusses how intentions can be shaped based on an agent’s relevant beliefs and other intentions or goals. It also introduces a preliminary concept of interpersonal commitments by relating one agent’s intentions to their beliefs about another agent’s intentions or beliefs. ...

January 30, 2024 · 4 min · 752 words · Sukai Huang

Christian Muise Planning for Goal Oriented Dialgue Systems 2019

[TOC] Title: Christian Muise Planning for Goal Oriented Dialgue Systems 2019 Author: Publish Year: Review Date: Tue, Jan 30, 2024 url: arXiv:1910.08137v1 Summary of paper Motivation there is increasing demand for dialogue agents capable of handling specific tasks and interactions in a business context Contribution the author propose a new approach that eliminates the need for manual specification of dialogue trees, a common practice in existing systems. they suggest using a declarative representation of the dialogue agent, which can be processed by advanced planning tech (tree -> planning) The paper introduces a paradigm shift in specifying complex dialogue agents by recognizing that many aspects of these agents share similarities or identical underlying processes. Instead of manually creating and maintaining entire dialogue graphs, the authors propose a declarative approach where behavior is specified compactly, and the complete implicit graphs are generated from this specification. Some key terms limitation of end to end trained machine learning architectures ...

January 30, 2024 · 2 min · 416 words · Sukai Huang

Vishal Pallagani Llm N Planning Survey 2024

[TOC] Title: “On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS).” Author: Pallagani, Vishal, et al. Publish Year: arXiv preprint arXiv:2401.02500 (2024). Review Date: Mon, Jan 29, 2024 url: Summary of paper Contribution The paper provides a comprehensive review of 126 papers focusing on the integration of Large Language Models (LLMs) within Automated Planning and Scheduling, a growing area in Artificial Intelligence (AI). It identifies eight categories where LLMs are applied in addressing various aspects of planning problems: ...

January 29, 2024 · 3 min · 546 words · Sukai Huang