[TOC]

  1. Title: Grounding Language Models to Images for Multimodal Generation
  2. Author: Jing Yu Koh et. al.
  3. Publish Year: 31 Jan 2023
  4. Review Date: Mon, Feb 6, 2023
  5. url: https://arxiv.org/pdf/2301.13823.pdf

Summary of paper

image-20230207134638732

Motivation

Contribution

LLMs for vision-and-language

efficient adaptation of pretrained models

Method

  1. We learn translation parameters (parameterized as linear layers) to cast images into text space,
  2. and translate text embeddings into visual space. (cycle???)

two training methods

  1. image captioning
    1. image-20230207153729144
  2. image-text retrieval
    1. image-20230207154441120
  3. image-20230207153914181

How does it output images

Good things about the paper (one paragraph)