[TOC]

  1. Title: The Wisdom of Hindsight Makes Language Models Better Instruction Followers
  2. Author: Tianjun Zhang et. al.
  3. Publish Year: 10 Feb 2023
  4. Review Date: Thu, Mar 2, 2023
  5. url: https://arxiv.org/pdf/2302.05206.pdf

Summary of paper

image-20230302190916037

Motivation

  • Reinforcement learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the pipeline for reward and value networks

Contribution

  • in this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner.
  • Such an algorithm doesn’t require any additional parameters except for the original language model and maximally reuses the pretraining pipeline.
  • To achieve this, we formulate instruction alignment problem in decision making. We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for alignment language models with instructions.
  • The resulting two-stage algorithm shed light to a family of reward-free approaches that utilise the hindsightly relabeled instructions based on feedback.

Some key terms

fine-tuning language model

  • the most widely adopted approach is to deploy reinforcement learning (RL) algorithms to optimize for a manually defined or learned “alignment score”.
  • Impressive progress has been made in this direction, including the more recently released GPT series model (OpenAI, 2022)
  • it is less data-efficient if it only makes use of the success instruction-output pairs, completely abandoning the ones that do not align.

Hindsight Instruction Relabeling (HIR)

  • adopts the central idea of relabeling the instructions in a hindsight fashion based on the generated outputs of the language model.
  • HIR alternates between two phases
    • an online sampling phrase to generate a dataset of instruction-output pairs,
      • image-20230303105519411
    • along with an offline learning phrase that relabels the instructions of each pair and performs standard supervised learning
      • image-20230303105634383

Offline Relabeling

  • The key component of our algorithm is the offline relabelling part. In this part, for every instruction-output pair $(p,q,o)$ that are not necessarily aligned
    • $p$ space of instructional prompt $p$
    • $q$ state space of input token sequence, used as query
    • $o$ is the output sequence (actions)
  • we relabel this pair with a new instruction that can align with the outcome of the model $(p*, q, o)$
  • The new instruction $p*$ is generated based on the feedback function $\mathcal R(p,q,o)$ and the instruction generation function $\phi(p,q,o,r)$, which can either be learned or scripted.
  • EXAMPLE
    • in the framework of RLHF, if the learned reward model $\mathcal R(p,q,o)$ generates a score that ranks about 75% as in the training data, we can give additional scripted instructions to the model such as “give me an answer that rank about 75% in training data”.

Conceptual Comparison between HIR and baseline methods

image-20230303111232312