[TOC]

  1. Title: Oscar: Object Semantic Aligned Pro Training for Vision Language Tasks
  2. Author: Xiujun Li et. al.
  3. Publish Year: 26 Jul 2020
  4. Review Date: Sat, Sep 3, 2022

Summary of paper

Motivation

Motivation

Some key terms

how self-attention transformer learn cross-modal contextualised representation

What things VLP method suffers

  1. ambiguity,
    1. the visual region features are usually extracted from over-sampled regions, which inevitably results in overlaps among image regions at different positions. This renders ambiguities for the extracted visual embeddings.
    2. i.e., overlap of objects in the image
  2. lack of grounding.
    1. there is no explicit label alignments between regions or objects in an image and words or phrases in text.
    2. therefore we may want to summarised the image further so that we can match with the abstract words.

OSCAR architecture

image-20220904135123964

Preprocess Region feature

Pre-training objective

Results

feature visualisation

image-20220904143351769

Good things about the paper (one paragraph)

The code and pre-trained models are released: https://github.com/microsoft/ Oscar