Jing_yu_koh Grounding Language Models to Images for Multimodal Generation 2023

February 6, 2023 · 2 min · 239 words · Sukai Huang | Submit a report

Table of Contents

Summary of paper
Good things about the paper (one paragraph)

[TOC]

Title: Grounding Language Models to Images for Multimodal Generation
Author: Jing Yu Koh et. al.
Publish Year: 31 Jan 2023
Review Date: Mon, Feb 6, 2023
url: https://arxiv.org/pdf/2301.13823.pdf

Summary of paper

Motivation

we propose an efficient method to ground pre-trained text-only language models to the visual domain
How
- we keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved

Contribution

our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pre-trained language models in visually grounded settings.

LLMs for vision-and-language

we differ from previous work in that our model is capable of generating coherent multimodal output: Flammingo is incapable of producing visual output.

efficient adaptation of pretrained models

our work builds upon the insights and methods from these prior works. While previous models mostly focus on generating text-only outputs, our model is capable of processing arbitrarily interleaved image-text inputs to generate coherent interleaved image-text outputs.

Method

We learn translation parameters (parameterized as linear layers) to cast images into text space,
and translate text embeddings into visual space. (cycle???)

two training methods

image captioning
image-text retrieval

How does it output images

it can do coreferencing to select the appropriate images

Good things about the paper (one paragraph)

the framework has the same function as CLIP but it utilises the pretrained large-scale visual language models.