Zhenfang_chen See Think Confirm Interactive Prompting Between Vision and Language Models for Knowledge Based Visual Reasoning 2023

[TOC]

Title: See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge Based Visual Reasoning
Author: Zhenfang Chen et. al.
Publish Year: 12 Jan 2023
Review Date: Mon, Feb 6, 2023
url: https://arxiv.org/pdf/2301.05226.pdf

Summary of paper

Solving the knowledge-based visual reasoning tasks remains challenging, which requires a model to comprehensively understand image content, connect external world knowledge, and perform step-by-step reasoning to answer the questions correctly.

We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge based visual reasoning.
IPVR contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend the key concepts from candidates adaptively. It then transforms them into text context for prompting with a visual captioning model and adopts the LLM to generate the answer.
The confirm stage further uses the LLM to generate the supporting rational to the answer, verify the generated rationale with a cross-modality classifier and ensure that the rationale can infer the predicted output consistently.

human process to handle knowledge-based visual reasoning

Dominant approaches for visual and language reasoning are mainly divided into two categories

the first category adds additional visual perception modules to transform the visual inputs into latent inputs for LLMs, and finetunes the models with massive vision-language data.
- but it requires a large vision-language dataset to finetune the LLM and the new visual modules for each downstream task, which are typically computational intensive and time consuming
the second category uses prompt-based methods for visual reasoning.
- first translate images into captions, which can then be used as textual prompt inputs for GPT3 models to answer the question.
- however, their model has several limitations. first the captioning process is independent of question’s semantics, limiting the caption to focus only on the image’s general aspects instead of the question-related objects
- second, their pipeline cannot provide a step-by-step reasoning trace, leaving the question-answering a black-box process.

The model gradually adds rationales in the prompt context to assist LLM to output the answer
- Assumption is that more context information is helping LLM to predict answers rather than disrupting the prediction.
Compared with existing prompting methods, it not only achieves better performance but also maintains high transparency by keeping the whole trace of each reasoning step.