[TOC]

  1. Title: BLIP2 - Boostrapping Language Image Pretraining 2023
  2. Author: Junnan Li et. al.
  3. Publish Year: 15 Jun 2023
  4. Review Date: Mon, Aug 28, 2023
  5. url: https://arxiv.org/pdf/2301.12597.pdf

Summary of paper

The paper titled “BLIP-2” proposes a new and efficient pre-training strategy for vision-and-language models. The cost of training such models has been increasingly prohibitive due to the large scale of the models. BLIP-2 aims to address this issue by leveraging off-the-shelf, pre-trained image encoders and large language models (LLMs) that are kept frozen during the pre-training process.

Key Components:

  1. Querying Transformer (Q-Former): A lightweight transformer that uses learnable query vectors to extract features from the frozen image encoder. It acts as an information bottleneck between the frozen image encoder and the frozen LLM.
  2. Two-Stage Pre-training Strategy:
    • First Stage: Vision-language representation learning, where Q-Former learns to extract visual features most relevant to the text.
    • Second Stage: Vision-to-language generative learning, where Q-Former is trained to produce visual representations that can be interpreted by the LLM.

Advantages:

The paper claims that BLIP-2 outperforms existing models like Flamingo80B by 8.7% on zero-shot VQAv2 while having 54x fewer trainable parameters.

Q-former

Information Bottleneck: The term “information bottleneck” refers to Q-Former’s role in selectively passing the most useful visual features from the frozen image encoder to the frozen LLM. It filters and condenses the information, ensuring that only the most relevant visual features are used for generating the desired text output.

Learnable Query Embeddings: Q-Former employs a set of learnable query vectors (or embeddings) that serve as input to the image transformer. These queries interact with each other through self-attention layers and with frozen image features through cross-attention layers. They can also interact with the text.

image-20230828214341500

image-20230828214809928