[TOC]

  1. Title: VinVL: Revisiting Visual Representations in Vision Language Models
  2. Author: Pengchuan Zhang et. al.
  3. Publish Year: 10 Mar 2021
  4. Review Date: Sat, Sep 3, 2022

Summary of paper

Motivation

Contribution

  1. has a bigger Object Detection model with larger amount of training data, called “ResNeXt-152 C4”

Some key terms

Vision Language Pretraining

Vision Language models typically consists of two modules

Training object detection module

  1. enhance visual concepts of tail classes, we perform class-aware sampling to get at least 2000 instances per class.
  2. balance the contribution of each dataset
  3. Unify the object vocabularies
  4. in the end they obtained 1848 classes
  5. C4 object detection architecture is better than FPN architecture

OSCAR+ pretraining method $$ \mathcal{L}{\text{Pre-training}} = \mathcal{L}{\text{MTL}} + \mathcal{L}_{\text{CL3}} $$

Good things about the paper (one paragraph)

Github Page: https://github.com/pzzhang/VinVL

Potential future work

transfer the triplet loss to our work