Theodore_r_sumers How to Talk So Ai Will Learn 2022

[TOC]

Summary of paper

To address this challenge, we formalise learning from language in a contextual bandit setting and ask how a human might communicate preferences over behaviours.
- (obtain intent (preference) from the presentation (behaviour))
we show that instructions are better in low-autonomy settings, but descriptions are better when the agent will need to act independently.
We then define a pragmatic listener agent that robustly infers the speaker’s reward function by reasoning how the speaker expresses themselves. (language reward module?)
we hope these insights facilitate a shift from developing agents that obey language to agents that learn from it.

two distinct types of language

Issues

prior work has highlighted the difficulty of specifying our desires via numerical reward functions
- here, we explore language as a means to communicate them
while most previous work on language input to AI systems focuses on instructions, we study instructions alongside more abstract, descriptive language
- from instructions to descriptions

how we learn from reward functions from others

However, natural language affords richer, under-explored forms of teaching

for example, an expert teaching a seminar might instead describe how to recognise edible or toxic mushrooms based on their features, thus providing highly generalised information
- (it means instead of the instruction, the text info also includes the rationales)

discovery

The horizon quantifies notions of autonomy described in previous work. If the horizon is short, the agent is closely supervised; at longer horizons, they are expected to act more independently. We then analyse instructions (which provide a partial policy) and descriptions (which provide partial information about the reward function). We show that instructions are optimal at short horizons, while descriptions are optimal at longer ones.

Limitations

In practice, it is difficult to specify a reward function to obtain desired behaviour, motivating learning the reward function from social input.
most social learning methods assume the expert is simply acting optimally, but recent pragmatic methods instead assume the expert is actively teaching.

Learning reward functions from descriptions

Rather than expressing specific goals, reward-descriptive language encodes abstract information about preferences or the world.
Other related lines of work use language which describes agent behaviours.
However, a smaller body of work uses it for RL : by learning reward function directly from existing bodies of text or interactive, free-form language input.
our work provides a formal model of such language in order to compare it with more typically studied instructions.

Speaker model is essential

we show how a carefully specified speaker model, incorporating both instructions and descriptions, allows our pragmatic listener to robustly infer a speaker’s reward function.
- this allows our pragmatic listener to robustly infer a speaker’s reward function

Reward assumption

we assume rewards are a linear function of these features (e.g., green mushrooms tend to be tasty)
- do not discuss the perturbed reward problem

Formalising learning from language in contextual bandits

difference between descriptions and instructions

what is the point

as the horizon lengthens, however, descriptions generalize better, thus allowing agents to solve unseen states based on the rationale information