[TOC]

  1. Title: Improving language models by retrieving from trillions of tokens
  2. Author: Sebastian Borgeaud et. al.
  3. Publish Year: Feb 2022
  4. Review Date: Mar 2022

Summary of paper

Motivation

in order to decrease the size of language model, this work suggested retrieval from a large text database as a complementary path to scaling language models.

they equip models with the ability to directly access a large dataset to perform prediction – a semi-parametric approach.

how do they do

first, construct a key-value database, where values store raw chunks of text tokens and key are frozen BERT embeddings.

  • then we use a frozen model (not trainable) to avoid having to periodically re-compute embeddings over the entire database during training.

second, each training sequence input is split into chunks, which are augmented with their k-nearest neighbours retrieved from the database.

  • e.g.,
  • image-20220321212024281
  • so a chunk C1 will have several value neighbours from the database.
  • image-20220321212114684

finally, a encoder-decoder architecture integrates retrieval chunks into the models’s predictions.

image-20220321212422235

  • check the CCA architecture, this preserves autoregressivity, the later token depends on the previous tokens.
  • FFW is the fully connected layer
  • ATTN is self attention module

Algorithm

image-20220321213747170

Some key terms

retrieval database

a key-value database

  • where values store raw chunks of text tokens and key are frozen BERT embeddings query keys
  • two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links
  • in this work, the value of the database is some information sentences
  • image-20220321211745626
  • image-20220321211759581

Potential future work

looks like we do have such a very large database

also the database input and output are both text sequences, which may not be useful for language assisted RL