Gaurav Ghosal the Effect of Modeling Human Rationality Level 2023

[TOC]

Title: The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types
Author: Gaurav R. Ghosal et. al.
Publish Year: 9 Mar 2023 AAAI 2023
Review Date: Fri, May 10, 2024
url: arXiv:2208.10687v2

Summary of paper

Contribution

We find that overestimating human rationality can have dire effects on reward learning accuracy and regret
We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases

Some key terms

What is Boltzmann Rationality coefficient $\beta$

Apply this Boltzmann rationality coefficient into PPO

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


import torch
import torch.nn.functional as F

class PolicyNetwork(torch.nn.Module):
    def __init__(self, ...):
        super().__init__()
        # define network layers

    def forward(self, state):
        # compute logits
        return logits

def select_action(logits, tau):
    probabilities = F.softmax(logits / tau, dim=-1)
    action = torch.distributions.Categorical(probabilities).sample()
    return action

# During training, adjust tau and use it in loss calculation
tau = initial_tau  # This could be a fixed value or decay over episodes
logits = policy_network(state)
action = select_action(logits, tau)



# in real codebase 
def forward(self, x: th.Tensor, beta: th.Tensor=None) -> th.Tensor:
    x = self.linear(x)
    if beta is not None:
        x = x * beta # ! for Human Rationality Level Beta parameter, 0 meaning the reward signal is random, 1 meaning the reward signal is perfect
    logits = F.log_softmax(x, dim=-1)
    return logits

essentially when the beta is low, the policy will have more exploration

Results

1. remark: underestimating $\beta$ is better than over estimating it

the author provided proof (proposition 1 and proposition 3)

Potential future work

What it did is straightforward, if you do not trust the reward signal, you let the policy to explore more.

Summary of paper#

Contribution#

Some key terms#

Results#

Potential future work#

Summary of paper

Contribution

Some key terms

Results

Potential future work