1. Title: Attention Over Learned Object Embeddings Enables Complex Visual Reasoning
  2. Author: David Ding et. al.
  3. Publish Year: 2021 NeurIPS
  4. Review Date: Dec 2021

Background info for this paper:

Their paper propose a all-in-one transformer model that is able to answer CLEVRER counterfactual questions with higher accuracy (75.6% vs 46.5%) and less training data (- 40%)

They believe that their model relies on three key aspects:

  1. self-attention
  2. soft-discretization
  3. self-supervised learning

image-20211214201703442

What is self-attention

What is soft-discretization

What is supervised learning

What is their result

image-20211214211119695

Their assumptions to break

A guiding motivation for the design of Aloe is the converging evidence for the value of self-attention mechanisms operating on a finite sequences of discrete entities. – The author

Our model relies on three key aspect: 1. Self-attention to effectively integrate information over time 2. … – The author

I do believe that this paper lacks the detail analysis about how attention mechanism solves the reasoning task.

In our basic understanding, attention mechanism would only extract association relationship knowledge. Essentially, attention mechanism was to permit the decoder to utilise the most relevant parts of the input sequence in a flexible manner, by a weighted combination of all of the encoded input vectors, with the most relevant vectors being attributed the highest weights. – Stefania Cristina

There is a conflict between this paper’s result and a common assumption that “statistical machine learning struggles with causality”

There must be more to examine rather than simply saying “Self-attention is good because it effectively integrates information over time