[TOC]

  1. Title: Eureka Human Level Reward Design via Coding Large Language Models 2023
  2. Author: Yecheng Jason Ma et. al.
  3. Publish Year: 19 Oct 2023
  4. Review Date: Fri, Oct 27, 2023
  5. url: https://arxiv.org/pdf/2310.12931.pdf

Summary of paper

image-20231027164539472

Motivation

  • harnessing LLMs to learn complex low-level manipulation tasks, remains an open problem.
  • we bridge this fundamental gap by using LLMs to produce rewards that can be used to acquire conplex skill via reinforcement learning.

Contribution

  • Eureka generate reward functions that outperform expert human-engineered rewards.
  • the generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF)
  • image-20231030132136067

Some key terms

  • given detailed environmental code and natural language description about the task, the LLMs can generate reward function candidate sampling.
  • As many real-world RL tasks admit sparse rewards that are difficult for learning, reward shaping that provides incremental learning signals is necessary in practice

reward design problem

image-20231027231538244

image-20231027231736218

Curriculum learning

image-20231030115254039

  • the paper mentioned that they used curriculum learning to train the model.
  • here are the key aspects of Curriculum learning:
    • gradual complexity increase
      • Curriculum learning starts by training models on easier or simpler tasks before gradually increasing the complexity of the tasks
    • improved learning efficiency and generalisation
    • Structured learning path
      • the curriculum provides a structured learning path, allowing the model to build upon previously learned concepts
    • implementation
      • implementing curriculum learning may involve designing a curriculum, which is a sequence of tasks of increasing complexity
    • relation to other concepts
      • curriculum learning shares similarities with concepts like transfer learning and multi-task learning, but with a focus on the structured, gradual increase in task complexity.

Some insights for the practical implementation

How does the model ensure the (semantic) correctness of the reward function ?

  1. EUREKA requires the environment specification to provided to the LLM. they directly feeding the raw environment code as context.
    • reason: the environment source code typically reveals what the environment semantically entails and which variables can and should be used to compose a reward function for the specified task
  2. they used evolutionary search to address the execution error and sub-optimality challenges.
    • a very big assumption behind the success of this method
    • image-20231030133130933
    • evolutionary search, i.e., refine the reward in the next iteration based on the performance in the current iteration.
      • simply specifying the mutation operator as a text prompt that suggests a few general ways to modify an existing reward code based on a textual summary of policy training
      • textual summary of policy training : after the policy training, we evaluate the policy performance
        • image-20231030133732964
      • basically it is a loop to refine the reward function.

How does the model avoid the computer vision part ?

Minor comments

image-20231027171151010

image-20231027171203317

Potential future work

  • having LLMs to generate reward function directly to train agents is something like replacing human experts with LLMs to design reward functions for learning agents.

  • The pipeline of this work is as follows:

    • the human users gave the task descriptions
    • -> the LLM convert the descriptions into a reward function
    • -> the reward function helps to train the agent that can accomplish the task
    • -> the human users can give further requirements in the next iteration
    • -> the LLM will further adjust the reward function to fit the users’ need.

Hybrid model

Nir suggested that we can have a hybrid model such that conditions and effects expressed partially in predicate logic, and partially specified through imperative programming languages

  • I think the it really depends on the type of the tasks

    • if the task is about planning or scheduling, e.g., Sudoku, then imperative programming has nothing to do with this
    • if the task is about low-level controls that have no explicit discrete procedures, then defining a reward function (this work) is suitable
    • if the task description explicitly contains the steps, then converting it to imperative programming language is suitable.
  • so it really depends on what is the task description is

    • e.g., cooking task provided with steps infoimage-20231030150628046 -> imperative programming
    • e.g., “your task is to stack block A on top of B” -> predicate logic as conditions and effects
  • so the hybrid model is somehow like a big model containing multiple specialised models that handles various types of tasks.

Imperative programming version of the action definition

both reward function generation for actions or direct imperative function for actions can be used as auxiliary information to tune the PDDL action definition

Reward function example from GPT4

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def reward_function(state, action):
    reward = 0

    # Assuming `is_on_ladder` and `is_moving_down` are functions that 
    # determine whether the agent is on a ladder and moving down, respectively.
    if is_on_ladder(state) and is_moving_down(action):
        reward += 1  # Give a positive reward for moving down a ladder

    # Optionally, penalize the agent for not moving down a ladder while on it
    elif is_on_ladder(state) and not is_moving_down(action):
        reward -= 1

    # Optionally, penalize or reward other behaviors
    # ...

    return reward

Imperative function example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class Environment:
    def __init__(self):
        self.agent_position = (0, 0)  # Assume the agent starts at the top-left corner
        self.ladder_position = (5, 5)  # Assume there's a ladder at position (5, 5)

    def get_agent_position(self):
        return self.agent_position

    def move_agent(self, direction):
        x, y = self.agent_position
        if direction == "down" and self.is_ladder_below():
            y += 1
        elif direction == "left":
            x -= 1
        elif direction == "right":
            x += 1
        elif direction == "up":
            y -= 1
        self.agent_position = (x, y)

    def is_ladder_below(self):
        x, y = self.agent_position
        return (x, y + 1) == self.ladder_position

    def is_on_ladder(self):
        return self.agent_position == self.ladder_position

def climb_down_ladder(env):
    while not env.is_on_ladder():
        # Move towards the ladder
        agent_position = env.get_agent_position()
        ladder_position = env.ladder_position
        if agent_position[0] < ladder_position[0]:
            env.move_agent("right")
        elif agent_position[0] > ladder_position[0]:
            env.move_agent("left")
        elif agent_position[1] < ladder_position[1]:
            env.move_agent("down")

    # Now on the ladder, climb down
    for _ in range(5):  # Assume ladder is 5 cells tall
        env.move_agent("down")

        
def check_climb_down_complete(env):
    agent_position = env.get_agent_position()
    ladder_position = env.ladder_position
    # Check if the agent's y-coordinate is the same or below the ladder's bottom y-coordinate
    if agent_position[1] >= ladder_position[1] and env.is_on_ladder():
        return True  # Climbing down the ladder is complete
    else:
        return False  # Climbing down the ladder is not complete
      
# Usage:
env = Environment()
climb_down_ladder(env)
  • both reward and imperative action function contains state checking (i.e., is_on_ladder and is_moving_down etc. )
  • the imperative programming version of the “climb_down_ladder” action contains a while loop that controls the agent to move towards the ladder before climbing down. This is different from the PDDL version action definition, where is_on_ladder is the precondition of the action.