[TOC]

  1. Title: VinVL: Revisiting Visual Representations in Vision Language Models
  2. Author: Pengchuan Zhang et. al.
  3. Publish Year: 10 Mar 2021
  4. Review Date: Sat, Sep 3, 2022

Summary of paper

Motivation

  • In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model Oscar.

  • And utilise an improved approach OSCAR + to pretrain the VL model

Contribution

  1. has a bigger Object Detection model with larger amount of training data, called “ResNeXt-152 C4”

Some key terms

Vision Language Pretraining

  • it often consists of two stages
    • an object detection model is pre-trained to encode an image and the visual objects in the image to feature vectors, and
    • a cross-modal fusion model is pre-trained to blend text and visual features.
  • this paper focuses on improving the object-centric visual representations and present a comprehensive empirical study to demonstrate that visual features matter in VL model.

Vision Language models typically consists of two modules

  • Deep learning-based VL models typically consists of two modules:
    • an image understanding module Vision
    • and a cross-modal understanding module VL
    • image-20220904172547834

Training object detection module

  1. enhance visual concepts of tail classes, we perform class-aware sampling to get at least 2000 instances per class.
  2. balance the contribution of each dataset
  3. Unify the object vocabularies
  4. in the end they obtained 1848 classes
  5. C4 object detection architecture is better than FPN architecture

OSCAR+ pretraining method $$ \mathcal{L}{\text{Pre-training}} = \mathcal{L}{\text{MTL}} + \mathcal{L}_{\text{CL3}} $$

  • MTL is the masked Token loss by masking text and tag tokens by 15% probability.
  • CL3 takes into account two types of training sample x: the
  • {caption, image-tags, image-features} triplets of the image captioning and image tagging data, and the {question, answer, image-features} triplets of the VQA data
  • contains 50% matched triples, 25% w-polluted triples, and 25% q- polluted triples.
  • Result
    • the proposed 3-way contrastive loss transfers well to both tasks. (text-image retrieval task and VQA task)

Good things about the paper (one paragraph)

Github Page: https://github.com/pzzhang/VinVL

Potential future work

transfer the triplet loss to our work