[TOC]

  1. Title: Eureka Human Level Reward Design via Coding Large Language Models 2023
  2. Author: Yecheng Jason Ma et. al.
  3. Publish Year: 19 Oct 2023
  4. Review Date: Fri, Oct 27, 2023
  5. url: https://arxiv.org/pdf/2310.12931.pdf

Summary of paper

image-20231027164539472

Motivation

Contribution

Some key terms

reward design problem

image-20231027231538244

image-20231027231736218

Curriculum learning

image-20231030115254039

Some insights for the practical implementation

How does the model ensure the (semantic) correctness of the reward function ?

  1. EUREKA requires the environment specification to provided to the LLM. they directly feeding the raw environment code as context.
    • reason: the environment source code typically reveals what the environment semantically entails and which variables can and should be used to compose a reward function for the specified task
  2. they used evolutionary search to address the execution error and sub-optimality challenges.
    • a very big assumption behind the success of this method
    • image-20231030133130933
    • evolutionary search, i.e., refine the reward in the next iteration based on the performance in the current iteration.
      • simply specifying the mutation operator as a text prompt that suggests a few general ways to modify an existing reward code based on a textual summary of policy training
      • textual summary of policy training : after the policy training, we evaluate the policy performance
        • image-20231030133732964
      • basically it is a loop to refine the reward function.

How does the model avoid the computer vision part ?

Minor comments

image-20231027171151010

image-20231027171203317

Potential future work

Hybrid model

Nir suggested that we can have a hybrid model such that conditions and effects expressed partially in predicate logic, and partially specified through imperative programming languages

Imperative programming version of the action definition

both reward function generation for actions or direct imperative function for actions can be used as auxiliary information to tune the PDDL action definition

Reward function example from GPT4

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def reward_function(state, action):
    reward = 0

    # Assuming `is_on_ladder` and `is_moving_down` are functions that 
    # determine whether the agent is on a ladder and moving down, respectively.
    if is_on_ladder(state) and is_moving_down(action):
        reward += 1  # Give a positive reward for moving down a ladder

    # Optionally, penalize the agent for not moving down a ladder while on it
    elif is_on_ladder(state) and not is_moving_down(action):
        reward -= 1

    # Optionally, penalize or reward other behaviors
    # ...

    return reward

Imperative function example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class Environment:
    def __init__(self):
        self.agent_position = (0, 0)  # Assume the agent starts at the top-left corner
        self.ladder_position = (5, 5)  # Assume there's a ladder at position (5, 5)

    def get_agent_position(self):
        return self.agent_position

    def move_agent(self, direction):
        x, y = self.agent_position
        if direction == "down" and self.is_ladder_below():
            y += 1
        elif direction == "left":
            x -= 1
        elif direction == "right":
            x += 1
        elif direction == "up":
            y -= 1
        self.agent_position = (x, y)

    def is_ladder_below(self):
        x, y = self.agent_position
        return (x, y + 1) == self.ladder_position

    def is_on_ladder(self):
        return self.agent_position == self.ladder_position

def climb_down_ladder(env):
    while not env.is_on_ladder():
        # Move towards the ladder
        agent_position = env.get_agent_position()
        ladder_position = env.ladder_position
        if agent_position[0] < ladder_position[0]:
            env.move_agent("right")
        elif agent_position[0] > ladder_position[0]:
            env.move_agent("left")
        elif agent_position[1] < ladder_position[1]:
            env.move_agent("down")

    # Now on the ladder, climb down
    for _ in range(5):  # Assume ladder is 5 cells tall
        env.move_agent("down")

        
def check_climb_down_complete(env):
    agent_position = env.get_agent_position()
    ladder_position = env.ladder_position
    # Check if the agent's y-coordinate is the same or below the ladder's bottom y-coordinate
    if agent_position[1] >= ladder_position[1] and env.is_on_ladder():
        return True  # Climbing down the ladder is complete
    else:
        return False  # Climbing down the ladder is not complete
      
# Usage:
env = Environment()
climb_down_ladder(env)