Tianjun_zhang the Wisdom of Hindsight Makes Language Models Better Instruction Followers 2023

[TOC]

Title: The Wisdom of Hindsight Makes Language Models Better Instruction Followers
Author: Tianjun Zhang et. al.
Publish Year: 10 Feb 2023
Review Date: Thu, Mar 2, 2023
url: https://arxiv.org/pdf/2302.05206.pdf

Summary of paper

Reinforcement learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the pipeline for reward and value networks

in this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner.
Such an algorithm doesn’t require any additional parameters except for the original language model and maximally reuses the pretraining pipeline.
To achieve this, we formulate instruction alignment problem in decision making. We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for alignment language models with instructions.
The resulting two-stage algorithm shed light to a family of reward-free approaches that utilise the hindsightly relabeled instructions based on feedback.

fine-tuning language model

the most widely adopted approach is to deploy reinforcement learning (RL) algorithms to optimize for a manually defined or learned “alignment score”.
Impressive progress has been made in this direction, including the more recently released GPT series model (OpenAI, 2022)
it is less data-efficient if it only makes use of the success instruction-output pairs, completely abandoning the ones that do not align.

Hindsight Instruction Relabeling (HIR)

adopts the central idea of relabeling the instructions in a hindsight fashion based on the generated outputs of the language model.
HIR alternates between two phases
- an online sampling phrase to generate a dataset of instruction-output pairs,
- along with an offline learning phrase that relabels the instructions of each pair and performs standard supervised learning

Offline Relabeling

The key component of our algorithm is the offline relabelling part. In this part, for every instruction-output pair $(p,q,o)$ that are not necessarily aligned
- $p$ space of instructional prompt $p$
- $q$ state space of input token sequence, used as query
- $o$ is the output sequence (actions)
we relabel this pair with a new instruction that can align with the outcome of the model $(p*, q, o)$
The new instruction $p*$ is generated based on the feedback function $\mathcal R(p,q,o)$ and the instruction generation function $\phi(p,q,o,r)$, which can either be learned or scripted.
EXAMPLE
- in the framework of RLHF, if the learned reward model $\mathcal R(p,q,o)$ generates a score that ranks about 75% as in the training data, we can give additional scripted instructions to the model such as “give me an answer that rank about 75% in training data”.