[TOC]

  1. Title: The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types
  2. Author: Gaurav R. Ghosal et. al.
  3. Publish Year: 9 Mar 2023 AAAI 2023
  4. Review Date: Fri, May 10, 2024
  5. url: arXiv:2208.10687v2

Summary of paper

image-20240510211346583

Contribution

  • We find that overestimating human rationality can have dire effects on reward learning accuracy and regret
  • We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases

Some key terms

What is Boltzmann Rationality coefficient $\beta$

image-20240510211612716

Apply this Boltzmann rationality coefficient into PPO

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import torch
import torch.nn.functional as F

class PolicyNetwork(torch.nn.Module):
    def __init__(self, ...):
        super().__init__()
        # define network layers

    def forward(self, state):
        # compute logits
        return logits

def select_action(logits, tau):
    probabilities = F.softmax(logits / tau, dim=-1)
    action = torch.distributions.Categorical(probabilities).sample()
    return action

# During training, adjust tau and use it in loss calculation
tau = initial_tau  # This could be a fixed value or decay over episodes
logits = policy_network(state)
action = select_action(logits, tau)



# in real codebase 
def forward(self, x: th.Tensor, beta: th.Tensor=None) -> th.Tensor:
    x = self.linear(x)
    if beta is not None:
        x = x * beta # ! for Human Rationality Level Beta parameter, 0 meaning the reward signal is random, 1 meaning the reward signal is perfect
    logits = F.log_softmax(x, dim=-1)
    return logits
  • essentially when the beta is low, the policy will have more exploration

Results

1. remark: underestimating $\beta$ is better than over estimating it

the author provided proof (proposition 1 and proposition 3)

image-20240510214340596 image-20240510214808544

Potential future work

What it did is straightforward, if you do not trust the reward signal, you let the policy to explore more.