[TOC]
- Title: Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
- Author: William Berrios et. al.
- Publish Year: 28 Jun 2023
- Review Date: Mon, Jul 3, 2023
- url: https://arxiv.org/pdf/2306.16410.pdf
Summary of paper
Contribution
- proposing LENS, a modular approach that addresses computer vision tasks by harnessing the few-shot, in-context learning abilities of language models through natural language descriptions of visual inputs
- LENS enables any off-the-shelf LLM to have visual capabilities without auxiliary training or data
LENS framework
- a redundant text prompt might be helpful
LENS components
- LENS consists of 3 distinct vision modules and 1 reasoning module, each serving a specific purpose based on the task at hand. These components are as follows:
- Prompt design
Potential future work
How to encode input image to text prompt, this paper provides a good approach
- we may combine this model with the Boosting Language Models Reasoning With Chain of Knowledge Prompting