[TOC]

  1. Title: Multi-modal Alignment Using Representation Codebook
  2. Author: Jiali Duan, Liqun Chen et. al.
  3. Publish Year: 2022 CVPR
  4. Review Date: Tue, Aug 9, 2022

Summary of paper

Motivation

Contribution

Some key terms

Types of Vision language pre-training tasks

  1. multimodal alignment: aligning the feature spaces of different modalities
    • late fusion approaches such as CLIP and ALIGN focus on this
  2. cross-modal fusion: capturing the interaction across modalities.
    • early fusion approaches such as OSCAR, VinVL and VilLT focus on this

**momentum distillation, **

  1. for each of the image, text and fusion encoder, there is a corresponding encoder that is updated through moving average without gradient back propagation. These momentum encoder serve as teachers to guide the self-supervised learning process. In this paper, we use the teachers to guide codebook learning as well as for the cross-modal and intra-modal alignment

codebook

codebook is a d-by-K matrix used as projector to project image and text feature into a common space.

Method

image-20220809142443262

  1. in this work, features from image and text modalities were first aligned and then fused using a transformer encoder.
  2. the main focus of the work is on the feature alignment stage -> make it more efficient
  3. the main contribution of this method is: using a codebook that quantizes the common text-image feature into codewords (cluster centre).
    1. this cluster centre provide a more stable means for contrastive reasoning compared to individual text or visual features.
    2. image-20220809143313638

Inspiration from SwAV

Two augmented versions (views) of the same input image were passed through a deep network for feature extraction. visual embedding was learned by optimising an objective function that enforces the consistency between the feature from one and the assigned cluster from the other view (different view leads to the same entity and that entity is represented as cluster (or codeword in this paper))

Overview of the framework

image-20220809165957333

image-20220809170012530

Optimal Transport

http://alexhwilliams.info/itsneuronalblog/2020/10/09/optimal-transport/

this allows the feature vector to be similar to one of the codeword cluster centre

image-20220809215150928

Optimal Transport is a little bit complex, we may want to use alternative way to implement this.

Essentially the idea is that we want to have an intermediate cluster centre vector so that both image feature and text feature can take projection on this.

Potential future work

Use the alignment loss in this paper to train our model

Although this model does not consider the temporal order / sequence alignment