David_ding Attention Over Learned Object Embeddings Enables Complex Visual Reasoning 2021

Title: Attention Over Learned Object Embeddings Enables Complex Visual Reasoning
Author: David Ding et. al.
Publish Year: 2021 NeurIPS
Review Date: Dec 2021

Background info for this paper:

Their paper propose a all-in-one transformer model that is able to answer CLEVRER counterfactual questions with higher accuracy (75.6% vs 46.5%) and less training data (- 40%)

They believe that their model relies on three key aspects:

self-attention
soft-discretization
self-supervised learning

What is self-attention

Essentially they means they used a Transformer architecture

What is soft-discretization

given the observation that self-attention module is good at handling discrete entities on a finite sequence, they want to discretise visual information so as to fit into self-attention module
Given the neuroscience literature that biological visual system infer and exploit the existence of objects rather than spatial or temporal blocks with artificial boundaries, also given that objects are atomic units of newtonian physics interactions, they discretise the visual image on the level of objects
Essentially this means that they used a object detection module to encode objects in the scene into embeddings
the embedding $u_{it}$ record the position of object $i$ in local coordinate at time $t$

What is supervised learning

They mask object embeddings, and train the model to infer the content of the masked object representations using its knowledge of unmasked objects.
They designed six different masking schemes, which is different from original BERT masking scheme because BERT masking scheme only fits word tokens

What is their result

As we can see, self attention mechanism is super important in their model. In the ablation study, the performance of Aloe model without self-attention will significantly decrease.
This means that self-attention mechanism is really essential in answering dynamic visual reasoning questions

Their assumptions to break

A guiding motivation for the design of Aloe is the converging evidence for the value of self-attention mechanisms operating on a finite sequences of discrete entities. – The author

Our model relies on three key aspect: 1. Self-attention to effectively integrate information over time 2. … – The author

I do believe that this paper lacks the detail analysis about how attention mechanism solves the reasoning task.

In our basic understanding, attention mechanism would only extract association relationship knowledge. Essentially, attention mechanism was to permit the decoder to utilise the most relevant parts of the input sequence in a flexible manner, by a weighted combination of all of the encoded input vectors, with the most relevant vectors being attributed the highest weights. – Stefania Cristina

There is a conflict between this paper’s result and a common assumption that “statistical machine learning struggles with causality”

There must be more to examine rather than simply saying “Self-attention is good because it effectively integrates information over time”