[TOC]

  1. Title: Grounding Language Models to Images for Multimodal Generation
  2. Author: Jing Yu Koh et. al.
  3. Publish Year: 31 Jan 2023
  4. Review Date: Mon, Feb 6, 2023
  5. url: https://arxiv.org/pdf/2301.13823.pdf

Summary of paper

image-20230207134638732

Motivation

  • we propose an efficient method to ground pre-trained text-only language models to the visual domain
  • How
    • we keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved

Contribution

  • our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pre-trained language models in visually grounded settings.

LLMs for vision-and-language

  • we differ from previous work in that our model is capable of generating coherent multimodal output: Flammingo is incapable of producing visual output.

efficient adaptation of pretrained models

  • our work builds upon the insights and methods from these prior works. While previous models mostly focus on generating text-only outputs, our model is capable of processing arbitrarily interleaved image-text inputs to generate coherent interleaved image-text outputs.

Method

  1. We learn translation parameters (parameterized as linear layers) to cast images into text space,
  2. and translate text embeddings into visual space. (cycle???)

two training methods

  1. image captioning
    1. image-20230207153729144
  2. image-text retrieval
    1. image-20230207154441120
  3. image-20230207153914181

How does it output images

  • it can do coreferencing to select the appropriate images
  • image-20230207154628454

Good things about the paper (one paragraph)

  • the framework has the same function as CLIP but it utilises the pretrained large-scale visual language models.