1. Title: Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
  2. Author: William Berrios et. al.
  3. Publish Year: 28 Jun 2023
  4. Review Date: Mon, Jul 3, 2023
  5. url: https://arxiv.org/pdf/2306.16410.pdf

Summary of paper



  • proposing LENS, a modular approach that addresses computer vision tasks by harnessing the few-shot, in-context learning abilities of language models through natural language descriptions of visual inputs
  • LENS enables any off-the-shelf LLM to have visual capabilities without auxiliary training or data

LENS framework


  • a redundant text prompt might be helpful

LENS components

  • LENS consists of 3 distinct vision modules and 1 reasoning module, each serving a specific purpose based on the task at hand. These components are as follows:
    • image-20230703195534921
    • image-20230703195553969
    • image-20230703195603531
    • image-20230703195609210
  • Prompt design
    • image-20230703195636645

Potential future work

How to encode input image to text prompt, this paper provides a good approach