Sebastian_borgeaud Improving Language Models by Retrieving From Trillions of Tokens 2022

[TOC]

Title: Improving language models by retrieving from trillions of tokens
Author: Sebastian Borgeaud et. al.
Publish Year: Feb 2022
Review Date: Mar 2022

Summary of paper

Motivation

in order to decrease the size of language model, this work suggested retrieval from a large text database as a complementary path to scaling language models.

they equip models with the ability to directly access a large dataset to perform prediction – a semi-parametric approach.

how do they do

first, construct a key-value database, where values store raw chunks of text tokens and key are frozen BERT embeddings.

then we use a frozen model (not trainable) to avoid having to periodically re-compute embeddings over the entire database during training.

second, each training sequence input is split into chunks, which are augmented with their k-nearest neighbours retrieved from the database.

e.g.,
so a chunk C1 will have several value neighbours from the database.

finally, a encoder-decoder architecture integrates retrieval chunks into the models’s predictions.

check the CCA architecture, this preserves autoregressivity, the later token depends on the previous tokens.
FFW is the fully connected layer
ATTN is self attention module

Algorithm

Some key terms

retrieval database

a key-value database

where values store raw chunks of text tokens and key are frozen BERT embeddings query keys
two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links
in this work, the value of the database is some information sentences

Potential future work

looks like we do have such a very large database

also the database input and output are both text sequences, which may not be useful for language assisted RL

Summary of paper#

Motivation#

Some key terms#

Potential future work#

Summary of paper

Motivation

Some key terms

Potential future work