Posts

Shunyu_yao Tree of Thoughts 2023

[TOC] Title: Tree of Thoughts: Deliberate Problem Solving with LLM Author: Shunyu Yao et. al. Publish Year: 17 May 2023 Review Date: Wed, May 24, 2023 url: https://arxiv.org/pdf/2305.10601.pdf Summary of paper Motivation might benefit from augmentation by a more deliberate “System 2” planning process that (1) maintains and explores diverse alternatives for current choices instead of just picking one, and (2) evaluates its current status and actively looks ahead or backtracks to make more global decisions. search through a combinatorial problem space, represented as a tree. We thus propose the Tree of Thoughts (ToT) framework for general problem solving with language models. Contribution limitation ...

Tom_silver Generalised Planning in PDDL Domains With Pretrained Large Language Models 2023

[TOC] Title: Generalised Planning in Pddl Domains With Pretrained Large Language Models Author: Tom Silver et. al. Publish Year: 18 May 2023 Review Date: Tue, May 23, 2023 url: https://arxiv.org/pdf/2305.11014.pdf Summary of paper Motivation in particular, we consider PDDL domains and use GPT-4 to synthesize Python programs, we also consider Chain of Thought (CoT) summarisation, where the LLM is prompted to summarize the domain and propose a strategy in words before synthesizing the program we consider automated debugging, where the program is validated with respect to the training tasks, and in case of errors, the LLM is re-prompted with four types of feedback. Contribution we find that GPT4 is a surprisingly powerful generalised planner. we also conclude that automated debugging is very important, that CoT summarisation has non-uniform impact, that GPT4 is far superior to GPT3.5, and that just two training tasks are often sufficient for strong generalisation. Some key terms the problem ...

Yongliang Hugginggpt 2023

[TOC] Title: HuggingGPT: Solving AI tasks with ChatGPT and its Friends in Hugging Face Author: Yongliang Shen et. al. Publish Year: 2 Apr 2023 Review Date: Tue, May 23, 2023 url: https://arxiv.org/pdf/2303.17580.pdf Summary of paper Motivation while there are abundant AI models available for different domains and modalities, they cannot handle complicated AI tasks. we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks and language could be a generic interface to empower this Contribution specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging face, execute each subtask with the selected AI model, and summarize the response according to the execution results. Some key terms Model ...

Yaqi_xie Translating Natural Language to Planning Goals With Llm 2023

[TOC] Title: Translating Natural Language to Planning Goals With LLM Author: Yaqi Xie et. al. Publish Year: 10 Feb 2023 Review Date: Mon, May 22, 2023 url: https://arxiv.org/pdf/2302.05128.pdf Summary of paper Motivation Unfortunately, recent work has also shown that LLMs are unable to perform accurate reasoning nor solve planning problem LLM can act as a natural interface between the planner and human users Our empirical results on GPT 3.5 variants show that LLMs are much better suited towards translation rather than planning. Contribution We find that LLMs are able to leverage commonsense knowledge and reasoning to furnish missing details from under-specified goals (as is often the case in natural language) Some key terms Architecture ...

Bo_liu Llmp Empowering Large Language Models With Optimal Planning Proficiency 2023

[TOC] Title: LLM+P Empowering Large Language Models With Optimal Planning Proficiency Author: Bo Liu Publish Year: 5 May 2023 Review Date: Mon, May 22, 2023 url: https://arxiv.org/pdf/2304.11477.pdf Summary of paper Motivation However, so far, LLMs cannot reliably solve long-horizon planning problems. By contrast, classical planners, once a problem is given in a formatted way, can use efficient search algorithms to quickly identify correct, or even optimal plans. Contribution introduce LLM+P, it takes in a natural language description of a planning problem, then return a correct plan for solving that problem in natural language. LLM+P does so by first converting the language description into a file written in the planning domain definition language (PDDL) limitation of the paper: In this paper, we do not ask the LLM to recognize that it has been posed a prompt that is suitable for processing using the proposed LLM+P pipeline. Some key terms limitation of LLMs ...

Siyu_yuan Distilling Script Knowledge From Large Language Models for Constrainted Language Planning 2023

[TOC] Title: Distilling Script Knowledge From Large Language Models for Constrainted Language Planning Author: Siyu Yuan et. al. Publish Year: 18 May 2023 Review Date: Mon, May 22, 2023 url: https://arxiv.org/pdf/2305.05252.pdf Summary of paper Motivation to accomplish everyday goals, human usually plan their actions in accordance with step-by-step instructions, such instruction are discovered as goal-oriented scripts. In this paper, we define the task of constrained language planning for the first time. We propose an over-generate-then-filter approach to improve large language models (LLMs) on this task, and use it to distill a novel constrained language planning dataset, CoScript, which consists of 55,000 scripts. Contribution the dataset Experiments show that, when trained on CoScript, smaller models such as T5 (Raffel et al., 2020) can achieve good performance, even surpassing that of LLMs Some key terms limitation of previous work ...

Junnan_li BLIP Bootstrapping Language Image Pre Training for Unified Vision Language Understanding and Generation 2022

[TOC] Title: BLIP Bootstrapping Language Image Pre Training for Unified Vision Language Understanding and Generation 2022 Author: Junnan Li et. al. Publish Year: 15 Feb 2022 Review Date: Mon, May 22, 2023 url: https://arxiv.org/pdf/2201.12086.pdf Summary of paper Motivation performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision Contribution BLIP effectively utilises the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. Some key terms Architecture ...

Harsh_jhamtani Natural Language Decomposition and Interpretation of Complex Utterances 2023

[TOC] Title: Natural Language Decomposition and Interpretation of Complex Utterances Author: Jacob Andreas Publish Year: 15 May 2023 Review Date: Mon, May 22, 2023 url: https://arxiv.org/pdf/2305.08677.pdf Summary of paper Motivation natural language interface often require supervised data to translate user request into structure intent representations however, during data collection, it can be difficult to anticipate and formalise the full range of user needs we introduce an approach for equipping a simple language to code model to handle complex utterances via a process of hierarchical natural language decomposition. Contribution Experiments show that the proposed approach enables the interpretation of complex utterances with almost no complex training data, while outperforming standard few-shot prompting approaches. Some key terms Methodology ...

Alexander_kirillov Segment Anything 2023

[TOC] Title: Segment Anything Author: Alexander Kirillov et. al. Publish Year: 5 Apr 2023 Review Date: Sun, May 21, 2023 url: https://arxiv.org/pdf/2304.02643.pdf Summary of paper Motivation we introduce the segment anything project: a new task, model and dataset for image segmentation. Using the model in a data collection loop, we built the largest segmentation dataset to date. Contribution the model is designed and trained to be promptable, so it can transfer zero-shot to new images distributions and tasks. background CLIP and ALIGN use contrastive learning to train text and image encoders that align the two modalities. goal of the authors ...

Rohit_gridhar Imagebind One Embedding Space to Bind Them All 2023

[TOC] Title: ImageBind One Embedding Space to Bind Them All Author: Rohit Girdhar et. al. Publish Year: 9 May 2023 Review Date: Mon, May 15, 2023 url: https://arxiv.org/pdf/2305.05665.pdf Summary of paper Motivation we present ImageBind, an approach to learn a joint embedding across six different modalities ImageBind can leverage recent large scale vision-language models, and extend their zero shot capabilities to new modalities just using their natural pairing with images. Contribution we show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. Some key terms multimodality binding ...

Qinghao_hitea Hierarchical Temporal Aware Video Language Pre Training 2022

[TOC] Title: Hierarchical Temporal Aware Video Language Pre Training Author: Qinghao Ye, Fei Huang et. al. Publish Year: 30 Dec 2022 Review Date: Thu, Apr 6, 2023 url: https://arxiv.org/pdf/2212.14546.pdf Summary of paper Motivation most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pretraining, thus not fully exploiting the unique characteristic of video, i.e., temporal. Contribution this paper, the two novel pretraining tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representations besides, the inherent temporal relations are capture by alignment video-text pairs as a whole in different time resolutions with multimodal temporal relation exploration tasks Some key terms limitation of previous work ...

Jacob_andreas Guiding Pretraining in Reinforcement Learning With Llms 2023

[TOC] Title: Guiding Pretraining in Reinforcement Learning With Large Language Models Author: Yuqing De, Jacob Andreas et. al. Publish Year: 13 Feb 2023 Review Date: Wed, Apr 5, 2023 url: https://arxiv.org/pdf/2302.06692.pdf Summary of paper Motivation intrinstically motivated exploration methods address sparse reward problem by rewarding agents for visiting novel states or transitions. Contribution we describe a method that uses background knowledge from text corpora to shape exploration. This method, call Exploring with LLMs, reward an agent for achieving goals suggested by a language model prompted with a description of agent’s current state. Some key terms How does ELLM work ...

Luke_zettlemoyer Scaling Expert Language Models With Unsupervised Domain Discovery 2023

[TOC] Title: Scaling Expert Language Models With Unsupervised Domain Discovery Author: Luke Zettlemoyer et. al. Publish Year: 24 Mar, 2023 Review Date: Mon, Apr 3, 2023 url: https://arxiv.org/pdf/2303.14177.pdf Summary of paper Contribution we introduce a simple but efficient method to asynchronously train large, sparse language models on arbitrary text corpora. Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalise embarrassingly parallel training by automatically discovering the domain for each expert, and eliminates nearly all the communication overhead of existing sparse language models. Some key terms Cluster-Branch-Train-Merge (C-BTM) ...

Xuanting_chen How Robust Is GPT 3.5 to Predecessors a Comprehensive Study on Language Understanding Tasks

[TOC] Title: How Robust Is GPT 3.5 to Predecessors a Comprehensive Study on Language Understanding Tasks Author: Xuanting Chen et. al. Publish Year: 2023 Review Date: Mon, Apr 3, 2023 url: https://arxiv.org/ftp/arxiv/papers/2303/2303.00293.pdf Summary of paper Motivation GPT3.5, their robustness, and abilities to handle various complexities of the open world have yet to be explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI Contribution Our study yielded the following findings by comparing GPT 3.5 with finetuned models competitive results on test sets: GPT3.5 achieves SOTA results in some NLU tasks compared to supervised models fine-tuned with task-specific data. In particular GPT-3.5 performs well in reading comprehension and sentiment analysis tasks, but face challenges in sequence tagging and relation extraction tasks. Lack of robustness: GPT-3.5 still encounter significant robustness degradation, such as its average performance dropping by up to 35.74% and 43.59% in natural language inference and sentiment analysis tasks, respectively. However, it is worth noting that GPT3.5 achieves remarkable robustness on certain tasks, such as reading comprehension and WSC tasks Robustness instability: In few-shot scenarios, GPT-3.5’s robustness improvement varies greatly across different tasks. For example, GPT-3.5 shows significant improvement in aspect-based sentiment analysis tasks while the robustness actually decreases in natural language inference (Section 4.3.1) and semantic matching (Section 4.3.2) tasks. Prompt sensitivity: changes in input prompts have a significant impact on the results, and GPT-3.5’s robustness to prompt variations. still requires improvement. Number sensitivity: GPT3.5 is more sensitive to numerical inputs than pre-training fine-tuning models. For example, in the NumWord transformation, which involves replacing numerical words in sentences with different numerical values, GPT3.5 exhibits a significantly high level of sensitivity. Task labels sensitivity: we speculate that the task construction during the instruction tuning stage may significantly impact the model’s performance. In the case of IMDB binary sentiment classification dataset, the model outputs a large number of “neutral” responses, which are not included in the application label space, resulting in a performance drop Significant improvement in zero/few-shot scenarios: in zero-shot and few-shot scenario, GPT3.5 outperforms existing LLMs in most NLU tasks, especially in reading comprehension, natural language inference and semantic matching tasks Ability for in-context learning: Compared to 0-shot, GPT 3.5 performs better on most tasks in the 1-shot setting. Additionally, performance does no vary significantly between the 1-shot, 3-shot, 6-shot, 9-shot settings for most tasks. However, providing additional examples in the prompts can be advantageous for sequence tagging tasks

Anthony_liu a Picture Is Worth a Thousand Words Language Models Plan From Pixels 2023

[TOC] Title: A Picture Is Worth a Thousand Words Language Models Plan From Pixels Author: Anthony Liu et.al. Publish Year: 16 Mar 2023 Review Date: Mon, Apr 3, 2023 url: https://arxiv.org/pdf/2303.09031v1.pdf Summary of paper Motivation planning is a important capability of AI that perform long-horizon tasks in real-world environments. prior PLM based approaches for planning either assume observations are available in the form of text, reason about plans from the instruction alone, or incorporate information about the visual environment in limited ways. Contribution in contrast, we show that PLMs can accurately plan even when observations are directly encoded as input prompts for the PLM Some key terms why we need the ability to reason about plans ...

Wenlong_huang Grounded Decoding Guiding Text Generation With Grounded Models for Robot Control 2023

[TOC] Title: Grounded Decoding Guiding Text Generation With Grounded Models for Robot Control Author: WenLong Huang et. al. Publish Year: 1 Mar, 2023 Review Date: Thu, Mar 30, 2023 url: https://arxiv.org/abs/2303.00855 Summary of paper Motivation Unfortunately, applying LLMs to settings with embodied agents, such as robots, is challenging due to their lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints that robots may require. on the other hand, language-conditioned robotic policies that learn from interaction data can provide the necessary grounding that allows the agent to be correctly situated in the real world, but such policies are limited by the lack of high-level semantic understanding due to the limited breadth of the interaction data available for training them. Contribution thus if we want to make use of the semantic knowledge in a language model while still situating it in an embodied setting, we must construct an action sequence that is both likely according to the language model and also realisable according to grounded models of the environment. we frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. Potential future work the work is related to using LMs info as a prior bias the problem framing is straightforward

Mariana_learning Generative Models With Goal Conditioned Reinforcement Learning 2023

[TOC] Title: Learning Generative Models With Goal Conditioned Reinforcement Learning Author: Mariana Vargas Vieyra et. al. Publish Year: 26 Mar 2023 Review Date: Thu, Mar 30, 2023 url: https://arxiv.org/abs/2303.14811 Summary of paper Contribution we present a novel framework for learning generative models with goal-conditioned reinforcement learning we define two agents, a goal conditioned agent (GC-agent) and a supervised agent (S-agent) Given a user-input initial state, the GC-agent learns to reconstruct the training set. In this context, elements in the training set are the goals. during training, the S-agent learns to imitate the GC-agent while remaining agnostic of the goals At inference we generate new samples with S-agent. Some key terms Goal-Conditioned Reinforcement Learning (GCRL) framework ...

Itsugun_cho Deep Rl With Hierarchical Action Exploration for Dialogue Generation 2023

[TOC] Title: Deep RL With Hierarchical Action Exploration for Dialogue Generation Author: Itsugun Cho et. al. Publish Year: 22 Mar 2023 Review Date: Thu, Mar 30, 2023 url: https://arxiv.org/pdf/2303.13465v1.pdf Summary of paper Motivation Approximate dynamic programming applied to dialogue generation involves policy improvement with action sampling. However, such a practice is inefficient for reinforcement learning because the eligible (high action value) responses are very sparse, and the greedy policy sustained by the random sampling is flabby. Contribution this paper shows that the performance of dialogue policy positively correlated with sampling size by theoretical and experimental. we introduce a novel dual-granularity Q-function to alleviate this limitation by exploring the most promising response category to intervene the sampling. Some key terms limitation of the maximum likelihood estimation (MLE) objective for the probability distribution of responses ...

Theodore_r_sumers How to Talk So Ai Will Learn 2022

[TOC] Title: How to talk so AI will learn: Instructions, descriptions, and autonomy Author: Theodore R. Sumers et. al. Publish Year: NeurIPS 2022 Review Date: Wed, Mar 15, 2023 url: https://arxiv.org/pdf/2206.07870.pdf Summary of paper Motivation yet today, we lack computational models explaining such language use Contribution To address this challenge, we formalise learning from language in a contextual bandit setting and ask how a human might communicate preferences over behaviours. (obtain intent (preference) from the presentation (behaviour)) we show that instructions are better in low-autonomy settings, but descriptions are better when the agent will need to act independently. We then define a pragmatic listener agent that robustly infers the speaker’s reward function by reasoning how the speaker expresses themselves. (language reward module?) we hope these insights facilitate a shift from developing agents that obey language to agents that learn from it. Some key terms two distinct types of language ...

Cheng_chi Diffusion Policy Visuomotor Policy Learning via Action Diffusion 2023

[TOC] Title: Diffusion Policy Visuomotor Policy Learning via Action Diffusion Author: Cheng Chi et. al. Publish Year: 2023 Review Date: Thu, Mar 9, 2023 url: https://diffusion-policy.cs.columbia.edu/diffusion_policy_2023.pdf Summary of paper Contribution introducing a new form of robot visuomotor policy that generates behaviour via a “conditional denoising diffusion process” on robot action space Some key terms Explicit policy learning this is like imitation learning Implicit policy aiming to minimise the estimation of the energy function learning this is like a standard reinforcement learning diffusion policy ...

Alan_lindsay Framer Planning Models From Natural Language Action Descriptions 2017

[TOC] Title: Framer: Planning Models From Natural Language Action Descriptions Author: Alan Lindsay et. al. Publish Year: 2017 Review Date: Thu, Mar 9, 2023 url: https://core.ac.uk/download/pdf/322329049.pdf Summary of paper Motivation for modelling assisting and model generation tools, there is a underlying assumption that the user can formulate the problem using some formal language. this motivates us to generate planning domain models directly from NL descriptions. Some key terms approach we start from NL descriptions of actions and use NL analysis to construct structured representation, from which we construct formal representations of action sequences ? only action sequence? what about the environment the generated action sequence provide the necessary structured input for inducing a PDDL domain, using domain model acquisition technology. we use an estimate of functional similarity, so sentences that describe similar behaviour are represented by the same planning operator. problem modelling ...

Siddharth_karamcheti Language Driven Representation Learning for Robotics 2023

[TOC] Title: Language-Driven Representation Learning for Robotics Author: Siddharth Karamcheti et. al. Publish Year: 24 Feb 2023 Review Date: Fri, Mar 3, 2023 url: https://arxiv.org/pdf/2302.12766.pdf Summary of paper Motivation recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks. leveraging methods such as masked autoencoding and contrastive learning, these representations exhibit strong transfer to policy learning for visuomotor control but robot learning encompasses a diverse set of problems beyond control including grasp affordance prediction, language-conditioned imitation learning, and intent scoring for human-robot collaboration amongst others. Contribution first, we demonstrate that existing representations yield inconsistent results across these tasks: masked autoencoding approaches pick up on low-level spatial features at the cost of high-level semantics, while contrastive learning approaches capture the opposite (i.e., high-level semantics) We then introduce Voltron, a framework for language driven representation learning from human videos and associated captions. Voltron trades off language conditioned visual reconstruction to learn low-level visual patterns (mask auto-encoding) and visually grounded language generation to encode high-level semantics. (hindsight relabelling and contrastive learning) Some key terms How can we learn visual representations that generalise across the diverse spectrum of problems in robot learning? ...

Tatsuki_kuribayashi Does Vision Accelerate Hierarchical Generalisation of Neural Language Learners 2023

[TOC] Title: Does Vision Accelerate Hierarchical Generalisation of Neural Language Learners Author: Tatsuki Kuribayashi Publish Year: 1 Feb 2023 Review Date: Fri, Mar 3, 2023 url: https://arxiv.org/pdf/2302.00667.pdf Summary of paper Motivation we want to know if the visual information improves hierarchical generalisaiton of the language model Contribution our results have exhibited that vision accelerated a proper linguistic generlisation in the simplified, artificial setting, but LMs struggled with the proper generalisation in the noisy, realistic setting. These mixed results have indicated several possibilities; for example, an image can potentially boost language acquisition, but learners’ additional visual/linguistic **prior knowledge should be needed t**o robustly make use of raw images for efficient language acquisition.

Jing_cheng_pang Natural Language Conditioned Reinforcement Learning With Inside Out Task Language Development and Translation 2023

[TOC] Title: Jing_cheng_pang Natural Language Conditioned Reinforcement Learning With Inside Out Task Language Development and Translation 2023 Author: Jing-Cheng Pang et. al. Publish Year: 18 Feb 2023 Review Date: Fri, Mar 3, 2023 url: https://arxiv.org/pdf/2302.09368.pdf Summary of paper Motivation previous approaches generally implemented language-conditioned RL by providing human instructions in natural language and training a following policy this is outside-in approach the policy needs to comprehend the NL and manage the task simultaneously. However, the unbounded NL examples often bring much extra complexity for solving concrete RL tasks, which can distract policy learning from completing the task Contribution we investigate an inside-out scheme for natural language-conditioned RL by developing a task language (TL) that is task-related and unique. The TL is used in RL to achieve high effective policy training. besides, a translator is trained to translate NL into TL. experiments indicate that the new model not only better comprehends NL instructions but also leads to better instruction following policy that improves 13.4% success rate and adapts to unseen expressions of NL instruction.

Suvaansh_bhambri Multi Level Compositional Reasoning for Interactive Instruction Following 2023

[TOC] Title: Multi-Level Compositional Reasoning for Interactive Instruction Following Author: Suvaansh Bhambri et. al. Publish Year: 2023 Review Date: Fri, Mar 3, 2023 url: https://ppolon.github.io/paper/aaai2023-alfred-mocha.pdf Summary of paper Motivation The task given to the agents are often composite thus are challenging as completing them require to reason about multiple subtasks. Contribution we propose to divide and conquer it by breaking the task into multiple subgoals and attend to them individually for better navigation and interaction. at the highest level, we infer a sequence of human-interpreatable subgoals to be executed based on the language instructions by a high-level policy composition controller. at the middle level, we discriminatively control the agent’s navigation by a master policy by alternating between a navigation policy and various independent interaction policies. finally, at the lowest level, we infer manipulation actions with the corresponding object masks using appropriate interaction policy. Model ...