Silviu Pitis Failure Modes of Learning Reward Models for Sequence Model 2023

[TOC] Title: Failure Modes of Learning Reward Models for LLMs and other Sequence Models Author: Silviu Pitis Publish Year: ICML workshop 2023 Review Date: Fri, May 10, 2024 url: https://openreview.net/forum?id=NjOoxFRZA4¬eId=niZsZfTPPt Summary of paper C3. Preference cannot represented as numbers M1. rationality level of human preference 3.2, if the condition/context changes, the preference may change rapidly, and this cannot reflect on the reward machine A2. Preference should be expressed with respect to state-policy pairs, rather than just outcomes A state-policy pair includes both the current state of the system and the strategy (policy) being employed....

<span title='2024-05-10 22:23:31 +1000 AEST'>May 10, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;312 words&nbsp;·&nbsp;Sukai Huang

Gaurav Ghosal the Effect of Modeling Human Rationality Level 2023

[TOC] Title: The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types Author: Gaurav R. Ghosal et. al. Publish Year: 9 Mar 2023 AAAI 2023 Review Date: Fri, May 10, 2024 url: arXiv:2208.10687v2 Summary of paper Contribution We find that overestimating human rationality can have dire effects on reward learning accuracy and regret We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases Some key terms What is Boltzmann Rationality coefficient $\beta$...

<span title='2024-05-10 19:35:03 +1000 AEST'>May 10, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;312 words&nbsp;·&nbsp;Sukai Huang

Nate Rahn Policy Optimization in Noisy Neighbourhood 2023

[TOC] Title: Policy Optimization in Noisy Neighborhood Author: Nate Rahn et. al. Publish Year: NeruIPS 2023 Review Date: Fri, May 10, 2024 url: https://arxiv.org/abs/2309.14597 Summary of paper Contribution in this paper, we demonstrate that high-frequency discontinuities in the mapping from policy parameters $\theta$ to return $R(\theta)$​ are an important cause of return variation. As a consequence of these discontinuities, a single gradient step or perturbation to the policy parameters often causes important changes in the return, even in settings where both the policy and the dynamics are deterministic....

<span title='2024-05-10 14:16:56 +1000 AEST'>May 10, 2024</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;510 words&nbsp;·&nbsp;Sukai Huang

Ademi Adeniji Language Reward Modulation for Pretraining Rl 2023

[TOC] Title: Language Reward Modulation for Pretraining Reinforcement Learning Author: Ademi Adeniji et. al. Publish Year: ICLR 2023 reject Review Date: Thu, May 9, 2024 url: https://openreview.net/forum?id=SWRFC2EupO Summary of paper Motivation Learned reward function (LRF) are notorious for noise and reward misspecification errors which can render them highly unreliable for learning robust policies with RL due to issues of reward exploitation and noisy models that these LRF’s are ill-suited for directly learning downstream tasks....

<span title='2024-05-09 21:18:00 +1000 AEST'>May 9, 2024</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;338 words&nbsp;·&nbsp;Sukai Huang

Thomas Coste Reward Model Ensembles Help Mitigate Overoptimization 2024

[TOC] Title: Reward Model Ensembles Help Mitigate Overoptimization Author: Thomas Coste et. al. Publish Year: 10 Mar 2024 Review Date: Thu, May 9, 2024 url: arXiv:2310.02743v2 Summary of paper Motivation however, as imperfect representation of the “true” reward, these learned reward models are susceptible to over-optimization. Contribution the author conducted a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specially worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization the author additionally extend the setup to include 25% label noise to better mirror real-world conditions For PPO, ensemble-based conservative optimization always reduce overoptimization and outperforms single reward model optimization Some key terms Overoptimization...

<span title='2024-05-09 14:06:33 +1000 AEST'>May 9, 2024</span>&nbsp;·&nbsp;1 min&nbsp;·&nbsp;205 words&nbsp;·&nbsp;Sukai Huang

Mengdi Li Internally Rewarded Rl 2023

[TOC] Title: Internally Rewarded Reinforcement Learning Author: Mengdi Li et. al. Publish Year: 2023 PMLR Review Date: Wed, May 8, 2024 url: https://proceedings.mlr.press/v202/li23ax.html Summary of paper Motivation the author studied a class o RL problem where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy (parallel training on both the policy and the reward model) this leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning , and conversely, an under-optimized policy impedes discriminator learning we call this learning setting Internally Rewarded RL (IRRL) as the reward is not provided directly by the environment but internally by the discriminator....

<span title='2024-05-08 14:59:15 +1000 AEST'>May 8, 2024</span>&nbsp;·&nbsp;4 min&nbsp;·&nbsp;682 words&nbsp;·&nbsp;Sukai Huang

Daniel Hierarchies of Reward Machines 2023

[TOC] Title: Hierarchies of Reward Machines Author: Daniel Furelos-Blanco et. al. Publish Year: 4 Jun 2023 Review Date: Fri, Apr 12, 2024 url: https://arxiv.org/abs/2205.15752 Summary of paper Motivation Finite state machine are a simple yet powerful formalism for abstractly representing temporal tasks in a structured manner. Contribution The work introduces Hierarchies of Reinforcement Models (HRMs) to enhance the abstraction power of existing models. Key contributions include: HRM Abstraction Power: HRMs allow for the creation of hierarchies of Reinforcement Models (RMs), enabling constituent RMs to call other RMs....

<span title='2024-04-12 15:12:54 +1000 AEST'>April 12, 2024</span>&nbsp;·&nbsp;5 min&nbsp;·&nbsp;965 words&nbsp;·&nbsp;Sukai Huang

Shanchuan Efficient N Robust Exploration Through Discriminative Ir 2023

[TOC] Title: DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards Author: Shanchuan Wan et. al. Publish Year: 18 May 2023 Review Date: Fri, Apr 12, 2024 url: https://arxiv.org/abs/2304.10770 Summary of paper Motivation Recent studies have shown the effectiveness of encouraging exploration with intrinsic rewards estimated from novelties in observations However, there is a gap between the novelty of an observation and an exploration, as both the stochasticity in the environment and agent’s behaviour may affect the observation....

<span title='2024-04-12 15:07:58 +1000 AEST'>April 12, 2024</span>&nbsp;·&nbsp;9 min&nbsp;·&nbsp;1795 words&nbsp;·&nbsp;Sukai Huang