Multimodal

Junnan_li Blip2 Boostrapping Language Image Pretraining 2023

[TOC] Title: BLIP2 - Boostrapping Language Image Pretraining 2023 Author: Junnan Li et. al. Publish Year: 15 Jun 2023 Review Date: Mon, Aug 28, 2023 url: https://arxiv.org/pdf/2301.12597.pdf Summary of paper The paper titled “BLIP-2” proposes a new and efficient pre-training strategy for vision-and-language models. The cost of training such models has been increasingly prohibitive due to the large scale of the models. BLIP-2 aims to address this issue by leveraging off-the-shelf, pre-trained image encoders and large language models (LLMs) that are kept frozen during the pre-training process. ...

Peng_gao Llama Adapter V2 2023

[TOC] Title: Llama Adapter V2 Author: Peng Gao et. al. Publish Year: 28 Apr 2023 Review Date: Mon, Aug 28, 2023 url: https://arxiv.org/pdf/2304.15010.pdf Summary of paper The paper presents LLaMA-Adapter V2, an enhanced version of the original LLaMA-Adapter designed for multi-modal reasoning and instruction following. The paper aims to address the limitations of the original LLaMA-Adapter, which could not generalize well to open-ended visual instructions and lagged behind GPT-4 in performance. ...

Pengchuan_zhang Vinvl Revisiting Visual Representations in Vision Language Models 2021

[TOC] Title: VinVL: Revisiting Visual Representations in Vision Language Models Author: Pengchuan Zhang et. al. Publish Year: 10 Mar 2021 Review Date: Sat, Sep 3, 2022 Summary of paper Motivation In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model Oscar. And utilise an improved approach OSCAR + to pretrain the VL model Contribution has a bigger Object Detection model with larger amount of training data, called “ResNeXt-152 C4” Some key terms Vision Language Pretraining ...

Xiujun_li Oscar Object Semantic Aligned Pro Training for Vision Language Tasks 2020

[TOC] Title: Oscar: Object Semantic Aligned Pro Training for Vision Language Tasks Author: Xiujun Li et. al. Publish Year: 26 Jul 2020 Review Date: Sat, Sep 3, 2022 Summary of paper Motivation Existing method simply concatenates image region features (patch features) and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner. the lack of explicit alignment information between the image regions and the text poses alignment modelling a weakly-supervised learning task. ...

Jiali_duan Multimodal Alignment Using Representation Codebook 2022

[TOC] Title: Multi-modal Alignment Using Representation Codebook Author: Jiali Duan, Liqun Chen et. al. Publish Year: 2022 CVPR Review Date: Tue, Aug 9, 2022 Summary of paper Motivation aligning signals from different modalities is an important step as it affects the performance of later stage such as cross-modality fusion. since image and text often reside in different regions of the feature space, directly aligning them at instance level is challenging especially when features are still evolving during training. Contribution in this paper, we treat image and text as two “views” of the same entity, and encode them into a joint vision-language coding space spanned by a dictionary of cluster centres (codebook). to further smooth out the learning process, we adopt a teacher-student distillation paradigm, where the momentum teacher of one view guides the student learning of the other. Some key terms Types of Vision language pre-training tasks ...

Deepmind Flamingo a Visual Language Model for Few Shot Learning 2022

[TOC] Title: Flamingo: a Visual Language Model for Few-Shot Learning Author: Jean-Baptiste Alayrac et. al. Publish Year: Apr 2022 Review Date: May 2022 Summary of paper Flamingo architecture Pretrained vision encoder: from pixels to features the model’s vision encoder is a pretrained Normalizer-Free ResNet (NFNet) they pretrain the vision encoder using a contrastive objective on their datasets of image and text pairs, using the two term contrastive loss from paper “Learning Transferable Visual Models From Natural Language Supervision” ...

Junyang_lin M6 a Chinese Multimodal Pretrainer 2021

[TOC] Title: M6: A Chinese Multimodal Pretrainer Author: Junyang Lin et. al. Publish Year: May 2021 Review Date: Jan 2022 Summary of paper This paper re-emphasises that large model trained on big data have extremely large capacity and it can outperform the SOTA in downstream tasks especially in the zero-shot setting. So, the author trained a big multi-modal model Also, they proposed a innovative way to tackle downstream tasks. they use masks to block cross attention between tokens so as to fit different types of downstream task Key idea: mask tokens during cross attention so as to solve certain tasks Overview ...