Posts

SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models

[TOC] Title: SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models Author: Ouxiang Li, Xinting Hu et. al. Publish Year: Mar 2025 Review Date: Wed, Apr 2, 2025 url: https://arxiv.org/abs/2503.07392 1 # input bibtex here Prerequisite: Training Networks in Null Space of Feature Covariance for Continual Learning https://arxiv.org/abs/2103.07113 Prerequisite knowledge projection on to a subspace of a vector Suppose $ U = {\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_k} $ are the orthogonal vectors spanning the subspace $ S $. Construct a matrix $ \mathbf{U} $ whose columns are these vectors: $$ \mathbf{U} = [\mathbf{u}_1 , \mathbf{u}_2 , \cdots , \mathbf{u}_k] $$ Here, $ \mathbf{U} $ is an $ n \times k $ matrix, where $ n $ is the dimension of the ambient space (e.g., $ \mathbb{R}^n $), and $ k $ is the number of basis vectors (the dimension of $ S $). ...

Awesome_life_long_rl_2025

Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem https://arxiv.org/pdf/2402.02868v3 Lifelong Reinforcement Learning with Modulating Masks https://arxiv.org/pdf/2212.11110 This has some connection with LLM + adapter Policy model.

Awesome LLMs with Different Abstraction of Language Data 2025

Overview there is a lack of this research on how different level of abstraction / granularity of language instructions could affect LLMs performance in instruction following or other works. ABSINSTRUCT: Eliciting Abstraction Ability from LLMs through Explanation Tuning with Plausibility Estimation https://arxiv.org/pdf/2402.10646 Inference Helps PLMs’ Conceptual Understanding: Improving the Abstract Inference Ability with Hierarchical Conceptual Entailment Graphs https://aclanthology.org/2024.emnlp-main.1233.pdf That is the end!

Survey of LLMs for Planning 2025

PlanGenLLMs: A Modern Survey of LLM Planning Capabilities https://arxiv.org/pdf/2502.11221 Understanding the planning of LLM agents: A survey https://arxiv.org/pdf/2402.02716 LLMs as Planning Modelers: A Survey for Leveraging large Language Models to Construct Automated Planning Models https://openreview.net/pdf?id=ebJIJkQjcE A Survey on Large Language Models for Automated Planning https://arxiv.org/pdf/2502.12435

HTN planning @ Pascal Bercher ANU

Overview of what is hierarchical planning 1. Daniel Hoeller’s dissertation https://oparu.uni-ulm.de/items/a6c64b47-76e7-4532-8179-3e215a9eac9c It has a summary of what is hierarchical planning comment: a $t$ is a task id that can either refer to a $c \in C$ or an $a \in A$, but a method decompose $c$ only. Ok I see, the point, the whole thing trys to allow the set to have duplicate (? why not just claim that you have a multiset) ...

Learning General Policies Through Sketch @ Hector Geffner

I will list some important literatures about the topic of learning general policies through sketches The research is initiated by Blai Bonet and Hector Geffner The high level goal of the research is as follows [!IMPORTANT] The construction of reusable knowledge (transfer learning) is a central concern in (deep) reinforcement learning, but the semantic and conceptual gap between the low level techniques that are used, and the high-level representations that are required, is too large. ...

Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms 2023

[TOC] Title: Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms Author: Shenao Zhang et. al. Publish Year: NIPS 2023 Review Date: Sun, Mar 2, 2025 url: https://proceedings.neurips.cc/paper_files/paper/2023/file/d78e9e4316e1714fbb0f20be66f8044c-Paper-Conference.pdf 1 2 3 4 5 6 7 8 9 10 11 12 13 # input bibtex here @inproceedings{DBLP:conf/nips/ZhangL0Z23, author = {Shenao Zhang and Boyi Liu and Zhaoran Wang and Tuo Zhao}, title = {Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms}, booktitle = {NeurIPS}, year = {2023} } ...

VLM/LLM for Embodied Agents, LLMs working as part of the policy

The study in this field is very messy I should say, a lot of researchers coming from different background and most of them try to publish their own embodied environments and baseline models. There is a lack of systematic study in this field. Most importantly, their model are really difficult to reproduce. In fact, there is no standard phrase for this research field. Some people call it instruction following with LM, some people call it language grounding in embodied environments, some people call it instruction-following with RL and all the papers in this area did not even try to reproduce other’s work and compare with each other. So, I want to say be careful to enter this area. ...

Neuro Symbolic Works From A.Prof. Hamid @ Monash

There are three papers from A.Prof. Hamid Rezatofighi :https://vl4ai.erc.monash.edu/positions.html NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning NEUSIS: A Compositional Neuro-Symbolic Framework for Autonomous Perception, Reasoning, and Planning in Complex UAV Search Missions HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning Multi-Objective Multi-Agent Planning for Discovering and Tracking Multiple Mobile Objects 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 # input bibtex here 1. @misc{cai2025naverneurosymboliccompositionalautomaton, title={NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning}, author={Zhixi Cai and Fucai Ke and Simindokht Jahangard and Maria Garcia de la Banda and Reza Haffari and Peter J. Stuckey and Hamid Rezatofighi}, year={2025}, eprint={2502.00372}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.00372}, } 2. @article{DBLP:journals/corr/abs-2409-10196, author = {Zhixi Cai and Cristian Rojas Cardenas and Kevin Leo and Chenyuan Zhang and Kal Backman and Hanbing Li and Boying Li and Mahsa Ghorbanali and Stavya Datta and Lizhen Qu and Julian Gutierrez Santiago and Alexey Ignatiev and Yuan{-}Fang Li and Mor Vered and Peter J. Stuckey and Maria Garcia de la Banda and Hamid Rezatofighi}, title = {{NEUSIS:} {A} Compositional Neuro-Symbolic Framework for Autonomous Perception, Reasoning, and Planning in Complex {UAV} Search Missions}, journal = {CoRR}, volume = {abs/2409.10196}, year = {2024} } 3. @inproceedings{DBLP:conf/eccv/KeCJWHR24, author = {Fucai Ke and Zhixi Cai and Simindokht Jahangard and Weiqing Wang and Pari Delir Haghighi and Hamid Rezatofighi}, title = {{HYDRA:} {A} Hyper Agent for Dynamic Compositional Visual Reasoning}, booktitle = {{ECCV} {(20)}}, series = {Lecture Notes in Computer Science}, volume = {15078}, pages = {132--149}, publisher = {Springer}, year = {2024} } 4. @article{DBLP:journals/tsp/NguyenVVRR24, author = {Hoa Van Nguyen and Ba{-}Ngu Vo and Ba{-}Tuong Vo and Hamid Rezatofighi and Damith C. Ranasinghe}, title = {Multi-Objective Multi-Agent Planning for Discovering and Tracking Multiple Mobile Objects}, journal = {{IEEE} Trans. Signal Process.}, volume = {72}, pages = {3669--3685}, year = {2024} } NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding (https://arxiv.org/abs/2502.00372) ...

Awesome LLMs Reasoning Abilities Papers

Demystifying Long Chain-of-Thought Reasoning in LLMs This study systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories and providing practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs https://arxiv.org/pdf/2502.03373.pdf DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning This work introduces first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL and achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. ...

Pallagani Plansformer Generating Plans 2023

[TOC] Title: Pallagani Plansformer Generating Plans 2023 Author: Pallagani, Vishal et. al. Publish Year: GenPlan 2023 Workshop Review Date: Tue, Dec 24, 2024 url: https://arxiw.org/pdf/2212.08681 1 2 3 4 5 6 7 8 9 10 11 12 13 # input bibtex here @InProceedings{pallagani2023plansformer, author = {Pallagani, Vishal and Muppasani, Bharath and Murugesan, Keerthiram and Rossi, Francesca and Horesh, Lior and Srivastava, Biplav and Fabiano, Francesco and Loreggia, Andrea}, title = {Plansformer: Generating Symbolic Plans using Transformers}, booktitle = {Seventh Workshop on Generalization in Planning (GenPlan 2023)}, year = {2023}, month = {December}, address = {New Orleans, USA}, venue = {Room 238-239, New Orleans Ernest N. Morial Convention Center} } Pallagani, Vishal, et al. "Plansformer: Generating Symbolic Plans using Transformers." NeurIPS 2023 Workshop on Generalization in Planning. [!Note] ...

Damai Dai Deepseekmoe 2024

[TOC] Title: DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture of Experts Language Models Author: Damai Dai et. al. Publish Year: 11 Jan 2024 Review Date: Sat, Jun 22, 2024 url: https://arxiv.org/pdf/2401.06066 Summary of paper Motivation conventional MoE architecture like GShard, which avtivate top-k out of N experts, face challenges in ensuring expert specialization, i.e., each expert acquires non-overlapping and focused knowledge, in response, we propose DeepSeekMoE architecture towards ultimate expert specialization Contribution segmenting expert into mN ones and activating mK from them isolating K_s, experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts Some key terms MoE architecture ...

Jessy Lin Learning to Model the World With Language 2024

[TOC] Title: Learning to Model the World With Language 2024 Author: Jessy Lin et. al. Publish Year: ICML 2024 Review Date: Fri, Jun 21, 2024 url: https://arxiv.org/abs/2308.01399 Summary of paper Motivation in this work, we propose that agents can ground diverse kinds of language by using it to predict the future in contrast to directly predicting what to do with a language-conditioned policy, Dynalang decouples learning to model the world with language (supervised learning with prediction objectives) from learning to act given that model (RL with task rewards) Future prediction provides a rich grounding signal for learning what language utterances mean, which in turn equip the agent with a richer understanding of the world to solve complex tasks. Contribution investigate whether learning language-conditioned world models enable agents to scale to more diverse language use, compared to language-conditioned policies. Some key terms related work ...

Verification in Llm Topic 2024

[TOC] Review Date: Thu, Jun 20, 2024 Verification in LLM Topic 2024 Paper 1: Weng, Yixuan, et al. “Large language models are better reasoners with self-verification.” arXiv preprint arXiv:2212.09561 (2022). the better reasoning with CoT is carried out in the following two steps, Forward Reasoning ad Backward Verification. Specifically, in Forward Reasoning, LLM reasoners generate candidate answers using CoT, and the question and candidate answers form different conclusions to be verified. And in Backward Verification, We mask the original condition and predict its result using another CoT. We rank candidate conclusions based on a verification score, which is calculated by assessing the consistency between the predicted and original condition values ...

Jiuzhou Reward Engineering for Generating Semi Structured Explan 2023

[TOC] Title: Reward Engineering for Generating Semi-Structured Explanation Author: Jiuzhou Han et. al. Publish Year: EACL2024 Review Date: Thu, Jun 20, 2024 url: https://github.com/Jiuzhouh/Reward-Engineering-for-Generating-SEG Summary of paper Motivation Contribution the objective is to equip moderately-sized LMs with the ability to not only provide answers but also generate structured explanations Some key terms Intro the author talked about some background on Cui et al. incorporate a generative pre-training mechanism over synthetic graphs by aligning inputs pairs of text-graph to improve the model’s capability in generating semi-structured explanation. ...

Jiuzhou Towards Uncertainty Aware Lang Agent 2024

[TOC] Title: Towards Uncertainty Aware Language Agent Author: Jiuzhou Han et. al. Publish Year: 30 May 2024 Review Date: Thu, Jun 20, 2024 url: arXiv:2401.14016v3 Summary of paper Motivation The existing approaches neglect the notion of uncertainty during these interactions Contribution Some key terms Related work 1: lang agent the author define what is language agent and discuss it – the prominent work of ReAct propose a general language agent framework to combine reasoning and acting with LLMs for solving diverse language reasoning tasks. continue the track of ReAct – introducing Reflexion, use the history failure trials as input to ask for reflection and can gain better results FireAct – add more diverse fine-tuning data to improve the performance Later the author mention Toolformer, Gorilla and other lang agent that is not start from ReAct ...

Silviu Pitis Failure Modes of Learning Reward Models for Sequence Model 2023

[TOC] Title: Failure Modes of Learning Reward Models for LLMs and other Sequence Models Author: Silviu Pitis Publish Year: ICML workshop 2023 Review Date: Fri, May 10, 2024 url: https://openreview.net/forum?id=NjOoxFRZA4¬eId=niZsZfTPPt Summary of paper C3. Preference cannot represented as numbers M1. rationality level of human preference 3.2, if the condition/context changes, the preference may change rapidly, and this cannot reflect on the reward machine A2. Preference should be expressed with respect to state-policy pairs, rather than just outcomes A state-policy pair includes both the current state of the system and the strategy (policy) being employed. This approach avoids the complication of unresolved stochasticity (randomness that hasn’t yet been resolved), focusing instead on scenarios where the outcomes of policies are already known. Example with Texas Hold’em: The author uses an example from poker to illustrate these concepts. In the example, a player holding a weaker hand (72o) wins against a stronger hand (AA) after both commit to large bets pre-flop. Traditional reward modeling would prefer the successful trajectory of the weaker hand due to the positive outcome. However, a rational analysis (ignoring stochastic outcomes) would prefer the decision-making associated with the stronger hand (AA), even though it lost, as it’s typically the better strategy. ...

Gaurav Ghosal the Effect of Modeling Human Rationality Level 2023

[TOC] Title: The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types Author: Gaurav R. Ghosal et. al. Publish Year: 9 Mar 2023 AAAI 2023 Review Date: Fri, May 10, 2024 url: arXiv:2208.10687v2 Summary of paper Contribution We find that overestimating human rationality can have dire effects on reward learning accuracy and regret We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases Some key terms What is Boltzmann Rationality coefficient $\beta$ ...

Nate Rahn Policy Optimization in Noisy Neighbourhood 2023

[TOC] Title: Policy Optimization in Noisy Neighborhood Author: Nate Rahn et. al. Publish Year: NeruIPS 2023 Review Date: Fri, May 10, 2024 url: https://arxiv.org/abs/2309.14597 Summary of paper Contribution in this paper, we demonstrate that high-frequency discontinuities in the mapping from policy parameters $\theta$ to return $R(\theta)$ are an important cause of return variation. As a consequence of these discontinuities, a single gradient step or perturbation to the policy parameters often causes important changes in the return, even in settings where both the policy and the dynamics are deterministic. unstable learning in some sense based on this observation, we demonstrate the usefulness of studying the landscape through the distribution of returns obtained from small perturbation of $\theta$ Some key terms Evidence that noisy reward signal leads to substantial variance in performance ...

Ademi Adeniji Language Reward Modulation for Pretraining Rl 2023

[TOC] Title: Language Reward Modulation for Pretraining Reinforcement Learning Author: Ademi Adeniji et. al. Publish Year: ICLR 2023 reject Review Date: Thu, May 9, 2024 url: https://openreview.net/forum?id=SWRFC2EupO Summary of paper Motivation Learned reward function (LRF) are notorious for noise and reward misspecification errors which can render them highly unreliable for learning robust policies with RL due to issues of reward exploitation and noisy models that these LRF’s are ill-suited for directly learning downstream tasks. Generalization ability issue of multi-modal vision and language model (VLM) ...

Thomas Coste Reward Model Ensembles Help Mitigate Overoptimization 2024

[TOC] Title: Reward Model Ensembles Help Mitigate Overoptimization Author: Thomas Coste et. al. Publish Year: 10 Mar 2024 Review Date: Thu, May 9, 2024 url: arXiv:2310.02743v2 Summary of paper Motivation however, as imperfect representation of the “true” reward, these learned reward models are susceptible to over-optimization. Contribution the author conducted a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specially worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization the author additionally extend the setup to include 25% label noise to better mirror real-world conditions For PPO, ensemble-based conservative optimization always reduce overoptimization and outperforms single reward model optimization Some key terms Overoptimization ...

Mengdi Li Internally Rewarded Rl 2023

[TOC] Title: Internally Rewarded Reinforcement Learning Author: Mengdi Li et. al. Publish Year: 2023 PMLR Review Date: Wed, May 8, 2024 url: https://proceedings.mlr.press/v202/li23ax.html Summary of paper Motivation the author studied a class o RL problem where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy (parallel training on both the policy and the reward model) this leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning , and conversely, an under-optimized policy impedes discriminator learning we call this learning setting Internally Rewarded RL (IRRL) as the reward is not provided directly by the environment but internally by the discriminator. Contribution proposed the clipped linear reward function. Results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance. we formulate a class of RL problems as IRRL, and formulate the inherent issues of noisy rewards that leads to an unstable training loop in IRRL we empirically characterize the noise in the discriminator and derive the effect of the reward function in reducing the bias of the estimated reward and the variance of reward noise from an underdeveloped discriminator Comment: the author tried to express the bias and variance of reward noises in Taylor approximation propose clipped linear reward function Some key terms Simultaneous optimization causes suboptimal training ...

Xuran Pan on the Integration of Self Attention and Convolution 2022

[TOC] Title: On the Integration of Self-Attention and Convolution Author: Xuran Pan et. al. Publish Year: 2022 IEEE Review Date: Thu, Apr 25, 2024 url: https://arxiv.org/abs/2111.14556 Summary of paper Motivation there exists a strong underlying relation between convolution and self-attention. Related work Convolution NN it uses convolution kernels to extract local features, have become the most powerful and conventional technique for various vision tasks Self-attention only Recently, vision transformer shows that given enough data, we can treat an image as a sequence of 256 tokens and leverage Transformer models to achieve competitive results in image recognition. Attention enhanced convolution ...

Recent Language Model Technique 2024

[TOC] Title: Recent Language Model Technique 2024 Review Date: Thu, Apr 25, 2024 url: https://www.youtube.com/watch?v=kzB23CoZG30 url2: https://www.youtube.com/watch?v=iH-wmtxHunk url3: https://www.youtube.com/watch?v=o68RRGxAtDo LLama 3 key modification: grouped query attention (GQA) key instruction-tuning process: Their approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO). The quality of the prompts that are used in SFT and the preference rankings that are used in PPO and DPO has an outsized influence on the performance of aligned models. fine-tuning tool: torchtune ...

Thomas Carta Grounding Llms in Rl 2023

[TOC] Title: Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning Author: Thomas Carta el. al. Publish Year: 6 Sep 2023 Review Date: Tue, Apr 23, 2024 url: arXiv:2302.02662v3 Summary of paper Summary The author considered an agent using an LLM as a policy that is progressively updated as the agent interacts with the environment, leveraging online reinforcement learning to improve its performance to solve goals (under the RL paradigm environment (MDP)) ...