Silviu Pitis Failure Modes of Learning Reward Models for Sequence Model 2023

[TOC] Title: Failure Modes of Learning Reward Models for LLMs and other Sequence Models Author: Silviu Pitis Publish Year: ICML workshop 2023 Review Date: Fri, May 10, 2024 url: https://openreview.net/forum?id=NjOoxFRZA4¬eId=niZsZfTPPt Summary of paper C3. Preference cannot represented as numbers M1. rationality level of human preference 3.2, if the condition/context changes, the preference may change rapidly, and this cannot reflect on the reward machine A2. Preference should be expressed with respect to state-policy pairs, rather than just outcomes A state-policy pair includes both the current state of the system and the strategy (policy) being employed. This approach avoids the complication of unresolved stochasticity (randomness that hasn’t yet been resolved), focusing instead on scenarios where the outcomes of policies are already known. Example with Texas Hold’em: The author uses an example from poker to illustrate these concepts. In the example, a player holding a weaker hand (72o) wins against a stronger hand (AA) after both commit to large bets pre-flop. Traditional reward modeling would prefer the successful trajectory of the weaker hand due to the positive outcome. However, a rational analysis (ignoring stochastic outcomes) would prefer the decision-making associated with the stronger hand (AA), even though it lost, as it’s typically the better strategy. ...

May 10, 2024 · 2 min · 312 words · Sukai Huang

Gaurav Ghosal the Effect of Modeling Human Rationality Level 2023

[TOC] Title: The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types Author: Gaurav R. Ghosal et. al. Publish Year: 9 Mar 2023 AAAI 2023 Review Date: Fri, May 10, 2024 url: arXiv:2208.10687v2 Summary of paper Contribution We find that overestimating human rationality can have dire effects on reward learning accuracy and regret We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases Some key terms What is Boltzmann Rationality coefficient $\beta$ ...

May 10, 2024 · 2 min · 312 words · Sukai Huang

Nate Rahn Policy Optimization in Noisy Neighbourhood 2023

[TOC] Title: Policy Optimization in Noisy Neighborhood Author: Nate Rahn et. al. Publish Year: NeruIPS 2023 Review Date: Fri, May 10, 2024 url: https://arxiv.org/abs/2309.14597 Summary of paper Contribution in this paper, we demonstrate that high-frequency discontinuities in the mapping from policy parameters $\theta$ to return $R(\theta)$​ are an important cause of return variation. As a consequence of these discontinuities, a single gradient step or perturbation to the policy parameters often causes important changes in the return, even in settings where both the policy and the dynamics are deterministic. unstable learning in some sense based on this observation, we demonstrate the usefulness of studying the landscape through the distribution of returns obtained from small perturbation of $\theta$ Some key terms Evidence that noisy reward signal leads to substantial variance in performance ...

May 10, 2024 · 3 min · 510 words · Sukai Huang

Ademi Adeniji Language Reward Modulation for Pretraining Rl 2023

[TOC] Title: Language Reward Modulation for Pretraining Reinforcement Learning Author: Ademi Adeniji et. al. Publish Year: ICLR 2023 reject Review Date: Thu, May 9, 2024 url: https://openreview.net/forum?id=SWRFC2EupO Summary of paper Motivation Learned reward function (LRF) are notorious for noise and reward misspecification errors which can render them highly unreliable for learning robust policies with RL due to issues of reward exploitation and noisy models that these LRF’s are ill-suited for directly learning downstream tasks. Generalization ability issue of multi-modal vision and language model (VLM) ...

May 9, 2024 · 2 min · 338 words · Sukai Huang

Thomas Coste Reward Model Ensembles Help Mitigate Overoptimization 2024

[TOC] Title: Reward Model Ensembles Help Mitigate Overoptimization Author: Thomas Coste et. al. Publish Year: 10 Mar 2024 Review Date: Thu, May 9, 2024 url: arXiv:2310.02743v2 Summary of paper Motivation however, as imperfect representation of the “true” reward, these learned reward models are susceptible to over-optimization. Contribution the author conducted a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specially worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization the author additionally extend the setup to include 25% label noise to better mirror real-world conditions For PPO, ensemble-based conservative optimization always reduce overoptimization and outperforms single reward model optimization Some key terms Overoptimization ...

May 9, 2024 · 1 min · 205 words · Sukai Huang

Mengdi Li Internally Rewarded Rl 2023

[TOC] Title: Internally Rewarded Reinforcement Learning Author: Mengdi Li et. al. Publish Year: 2023 PMLR Review Date: Wed, May 8, 2024 url: https://proceedings.mlr.press/v202/li23ax.html Summary of paper Motivation the author studied a class o RL problem where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy (parallel training on both the policy and the reward model) this leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning , and conversely, an under-optimized policy impedes discriminator learning we call this learning setting Internally Rewarded RL (IRRL) as the reward is not provided directly by the environment but internally by the discriminator. Contribution proposed the clipped linear reward function. Results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance. we formulate a class of RL problems as IRRL, and formulate the inherent issues of noisy rewards that leads to an unstable training loop in IRRL we empirically characterize the noise in the discriminator and derive the effect of the reward function in reducing the bias of the estimated reward and the variance of reward noise from an underdeveloped discriminator Comment: the author tried to express the bias and variance of reward noises in Taylor approximation propose clipped linear reward function Some key terms Simultaneous optimization causes suboptimal training ...

May 8, 2024 · 4 min · 682 words · Sukai Huang