Jiali_duan Multimodal Alignment Using Representation Codebook 2022

[TOC]

Title: Multi-modal Alignment Using Representation Codebook
Author: Jiali Duan, Liqun Chen et. al.
Publish Year: 2022 CVPR
Review Date: Tue, Aug 9, 2022

Summary of paper

Motivation

aligning signals from different modalities is an important step as it affects the performance of later stage such as cross-modality fusion.
since image and text often reside in different regions of the feature space, directly aligning them at instance level is challenging especially when features are still evolving during training.

Contribution

in this paper, we treat image and text as two “views” of the same entity, and encode them into a joint vision-language coding space spanned by a dictionary of cluster centres (codebook).
to further smooth out the learning process, we adopt a teacher-student distillation paradigm, where the momentum teacher of one view guides the student learning of the other.

Some key terms

Types of Vision language pre-training tasks

multimodal alignment: aligning the feature spaces of different modalities
- late fusion approaches such as CLIP and ALIGN focus on this
cross-modal fusion: capturing the interaction across modalities.
- early fusion approaches such as OSCAR, VinVL and VilLT focus on this

**momentum distillation, **

for each of the image, text and fusion encoder, there is a corresponding encoder that is updated through moving average without gradient back propagation. These momentum encoder serve as teachers to guide the self-supervised learning process. In this paper, we use the teachers to guide codebook learning as well as for the cross-modal and intra-modal alignment

codebook

codebook is a d-by-K matrix used as projector to project image and text feature into a common space.

Method

in this work, features from image and text modalities were first aligned and then fused using a transformer encoder.
the main focus of the work is on the feature alignment stage -> make it more efficient
the main contribution of this method is: using a codebook that quantizes the common text-image feature into codewords (cluster centre).
1. this cluster centre provide a more stable means for contrastive reasoning compared to individual text or visual features.

Inspiration from SwAV

Two augmented versions (views) of the same input image were passed through a deep network for feature extraction. visual embedding was learned by optimising an objective function that enforces the consistency between the feature from one and the assigned cluster from the other view (different view leads to the same entity and that entity is represented as cluster (or codeword in this paper))

Effectively, visual and text features are lined up via aligning with the common codewords during training.

Overview of the framework

Optimal Transport

http://alexhwilliams.info/itsneuronalblog/2020/10/09/optimal-transport/

this allows the feature vector to be similar to one of the codeword cluster centre

Optimal Transport is a little bit complex, we may want to use alternative way to implement this.

Essentially the idea is that we want to have an intermediate cluster centre vector so that both image feature and text feature can take projection on this.

Potential future work

Use the alignment loss in this paper to train our model

Although this model does not consider the temporal order / sequence alignment

Summary of paper#

Motivation#

Contribution#

Some key terms#

Method#

Potential future work#