Ademi Adeniji Language Reward Modulation for Pretraining Rl 2023

[TOC]

Title: Language Reward Modulation for Pretraining Reinforcement Learning
Author: Ademi Adeniji et. al.
Publish Year: ICLR 2023 reject
Review Date: Thu, May 9, 2024
url: https://openreview.net/forum?id=SWRFC2EupO

Summary of paper

Motivation

Learned reward function (LRF) are notorious for noise and reward misspecification errors

which can render them highly unreliable for learning robust policies with RL
due to issues of reward exploitation and noisy models that these LRF’s are ill-suited for directly learning downstream tasks.

Generalization ability issue of multi-modal vision and language model (VLM)

nah, the author did not mention it

Contribution

Using LAMP directly as a task reward would not enable the RL agents to solve any of the tasks. As we evidence in Figure 3, and motivate in the introduction, rewards from current pretrained VLMs are far too noisy and susceptible to reward exploitation to be usable with online RL. In fact, this is central to our message - current VLMs are ill-suited for this purpose in that they lack the required precision to supervise challenging robot manipulation tasks, however, they do possess certain properties that render them very useful for pretraining RL agents with limited supervision.

Results

Summary

Regrettably, the author did not provide compelling evidence to support the claim that the VLM reward signal is ineffective for training RL agents. Reviewers have requested further explanation on why VLM reward signals may fall short in training complex RL tasks.

Critics often counter with remarks such as, “If this is indeed a problem, why then do many studies successfully employ visual-text alignment based reward models?” So, it is very challenging to convince people that certain widely accepted methods may be flawed or inadequate. It’s a tough journey.

Additionally, the pretraining evaluation section in the paper does not substantiate the author’s assertion that the VLM reward signal is noisy and therefore unsuitable for training RL tasks. From the presented pretraining performance, it is not apparent whether the VLM pretraining results are suboptimal.

Potential future work

it is equivalent to use Language based signals and then abandon it.

Summary of paper#

Motivation#

Contribution#

Results#

Summary#

Potential future work#