[TOC]

  1. Title: Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval
  2. Author: Gregor Geigle et. al.
  3. Publish Year: 19 Feb, 2022
  4. Review Date: Sat, Aug 27, 2022

Summary of paper

Motivation

they want to combine the cross encoder and the bi encoder advantages and have a more efficient cross-modal search and retrieval

  1. efficiency and simplicity of BE approach based on twin network
  2. expressiveness and cutting-edge performance of CE methods.

Contribution

We propose a novel joint Cross Encoding and Binary Encoding model (Joint-Coop), which is trained to simultaneously cross-encode and embed multi-modal input; it achieves the highest scores overall while maintaining retrieval efficiency

image-20220827144124480

Some key terms

Bi-encoder approach

encodes images and text separately and then induces a shared high-dimensional multi-modal feature space. very common

cross attention based approach

apply a cross attention mechanism between examples from the two modalities to compute their similarity scores

image-20220827004208773

Methodology

Pretraining part

cross-encoder training

bi-encoding training

bi-encoding retrieval

Joint Coop process

Training setup and hyperparameters

image-20220827155212244

Good things about the paper (one paragraph)

Github page: https://github.com/UKPLab/MMT-Retrieval

Major comments

Minor comments

Incomprehension

Potential future work

Try this to get similarity score for multi-modality data

Also try DeepNet to solve gradient vanishing problem