Multimodal Learning

Haotian Liu Improved Baselines With Visual Instruction Tuning 2023

[TOC] Title: Improved Baselines With Visual Instruction Tuning Author: Haotian Liu et. al. Publish Year: Oct 5 2023 Review Date: Sun, Oct 8, 2023 url: https://arxiv.org/pdf/2310.03744.pdf Summary of paper Motivation we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. Contribution with simple modifications to LLaVA, namely, using CLIP-ViT with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, they establish stronger baseline. Some key terms Improvement one: MLP cross modal connector ...

Junnan_li BLIP Bootstrapping Language Image Pre Training for Unified Vision Language Understanding and Generation 2022

[TOC] Title: BLIP Bootstrapping Language Image Pre Training for Unified Vision Language Understanding and Generation 2022 Author: Junnan Li et. al. Publish Year: 15 Feb 2022 Review Date: Mon, May 22, 2023 url: https://arxiv.org/pdf/2201.12086.pdf Summary of paper Motivation performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision Contribution BLIP effectively utilises the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. Some key terms Architecture ...

Rohit_gridhar Imagebind One Embedding Space to Bind Them All 2023

[TOC] Title: ImageBind One Embedding Space to Bind Them All Author: Rohit Girdhar et. al. Publish Year: 9 May 2023 Review Date: Mon, May 15, 2023 url: https://arxiv.org/pdf/2305.05665.pdf Summary of paper Motivation we present ImageBind, an approach to learn a joint embedding across six different modalities ImageBind can leverage recent large scale vision-language models, and extend their zero shot capabilities to new modalities just using their natural pairing with images. Contribution we show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. Some key terms multimodality binding ...

Qinghao_hitea Hierarchical Temporal Aware Video Language Pre Training 2022

[TOC] Title: Hierarchical Temporal Aware Video Language Pre Training Author: Qinghao Ye, Fei Huang et. al. Publish Year: 30 Dec 2022 Review Date: Thu, Apr 6, 2023 url: https://arxiv.org/pdf/2212.14546.pdf Summary of paper Motivation most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pretraining, thus not fully exploiting the unique characteristic of video, i.e., temporal. Contribution this paper, the two novel pretraining tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representations besides, the inherent temporal relations are capture by alignment video-text pairs as a whole in different time resolutions with multimodal temporal relation exploration tasks Some key terms limitation of previous work ...

Anthony_liu a Picture Is Worth a Thousand Words Language Models Plan From Pixels 2023

[TOC] Title: A Picture Is Worth a Thousand Words Language Models Plan From Pixels Author: Anthony Liu et.al. Publish Year: 16 Mar 2023 Review Date: Mon, Apr 3, 2023 url: https://arxiv.org/pdf/2303.09031v1.pdf Summary of paper Motivation planning is a important capability of AI that perform long-horizon tasks in real-world environments. prior PLM based approaches for planning either assume observations are available in the form of text, reason about plans from the instruction alone, or incorporate information about the visual environment in limited ways. Contribution in contrast, we show that PLMs can accurately plan even when observations are directly encoded as input prompts for the PLM Some key terms why we need the ability to reason about plans ...

Tatsuki_kuribayashi Does Vision Accelerate Hierarchical Generalisation of Neural Language Learners 2023

[TOC] Title: Does Vision Accelerate Hierarchical Generalisation of Neural Language Learners Author: Tatsuki Kuribayashi Publish Year: 1 Feb 2023 Review Date: Fri, Mar 3, 2023 url: https://arxiv.org/pdf/2302.00667.pdf Summary of paper Motivation we want to know if the visual information improves hierarchical generalisaiton of the language model Contribution our results have exhibited that vision accelerated a proper linguistic generlisation in the simplified, artificial setting, but LMs struggled with the proper generalisation in the noisy, realistic setting. These mixed results have indicated several possibilities; for example, an image can potentially boost language acquisition, but learners’ additional visual/linguistic **prior knowledge should be needed t**o robustly make use of raw images for efficient language acquisition.

Xiwen_liang Contrastive Instruction Trajectory Learning for Vision Language Navigation 2022

[TOC] Title: Contrastive Instruction Trajectory Learning for Vision Language Navigation Author: Xiwen Liang et. al. Publish Year: AAAI 2022 Review Date: Fri, Feb 10, 2023 url: https://arxiv.org/abs/2112.04138 Summary of paper Motivation previous works learn to navigate step-by-step following an instruction. However, these works may fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions. These problems hinder agents from learning distinctive vision-and-language representations, Contribution we propose a coarse-grained contrastive learning objective to enhance vision-and-language representations by contrasting semantics of full trajectory observations and instructions respectively; a fine-grained contrastive learning objective to perceive instructions by leveraging the temporal information of the sub-instructions. a pairwise sample-reweighting mechanism for contrastive learning to sampling bias in contrastive learning. Some key terms Limitation of current VLN model ...

Zhuosheng_zhang Multimodal Chain of Thought Reasoning in Language Models 2023

[TOC] Title: Multimodal Chain of Thought Reasoning in Language Models Author: Zhuosheng Zhang et. al. Publish Year: 2023 Review Date: Wed, Feb 8, 2023 url: https://arxiv.org/pdf/2302.00923.pdf Summary of paper Motivation LLMs have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. to elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. Contribution We propose Mutimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages, the model is able to generate effective rationales that contribute to answer inference. Some key terms Multimodal-CoT ...

Jing_yu_koh Grounding Language Models to Images for Multimodal Generation 2023

[TOC] Title: Grounding Language Models to Images for Multimodal Generation Author: Jing Yu Koh et. al. Publish Year: 31 Jan 2023 Review Date: Mon, Feb 6, 2023 url: https://arxiv.org/pdf/2301.13823.pdf Summary of paper Motivation we propose an efficient method to ground pre-trained text-only language models to the visual domain How we keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved Contribution our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pre-trained language models in visually grounded settings. Related work LLMs for vision-and-language ...

Zhenfang_chen See Think Confirm Interactive Prompting Between Vision and Language Models for Knowledge Based Visual Reasoning 2023

[TOC] Title: See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge Based Visual Reasoning Author: Zhenfang Chen et. al. Publish Year: 12 Jan 2023 Review Date: Mon, Feb 6, 2023 url: https://arxiv.org/pdf/2301.05226.pdf Summary of paper Motivation Solving the knowledge-based visual reasoning tasks remains challenging, which requires a model to comprehensively understand image content, connect external world knowledge, and perform step-by-step reasoning to answer the questions correctly. Contribution We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge based visual reasoning. IPVR contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend the key concepts from candidates adaptively. It then transforms them into text context for prompting with a visual captioning model and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rational to the answer, verify the generated rationale with a cross-modality classifier and ensure that the rationale can infer the predicted output consistently. Some key terms human process to handle knowledge-based visual reasoning ...

Xin_wang Reinforced Cross Modal Matching and Self Supervised Imitation Learning for Vision Language Navigation 2019

[TOC] Title: Reinforced Cross Modal Matching and Self Supervised Imitation Learning for Vision Language Navigation 2019 Author: Xin Wang et. al. Publish Year: Review Date: Wed, Jan 18, 2023 Summary of paper Motivation Visual Language Navigation (VLN) presents some unique challenges first, reasoning over images and natural language instructions can be difficult. secondly, except for strictly following expert demonstrations, the feedback is rather coarse, since the “Success” feedback is provided only when the agent reaches a target position (sparse reward) A good “instruction following” trajectory may ended up just stop before you reaching the goal state and then receive zero rewards. existing work suffer from generalisation problem. (need to retrain the agent in new environment) Implementation agent can infer which sub-instruction to focus on and where to look at. (automatic splitting long instruction) with a matching critic that evaluates an executed path by the probability of reconstructing the original instruction from the executed path. P(original instruction | past trajectory) cycle reconstruction: we have P(target trajectory | the instruction) = 1, and we want to measure P(original instruction | past trajectory) this will enhance the interpretability as now you understand how the robot was thinking about