1. Title: What Makes Good Examples for Visual in Context Learning
  2. Author: Yuan Zhang et. al.
  3. Publish Year: 1 Feb 2023
  4. Review Date: Mon, Feb 6, 2023
  5. url: https://arxiv.org/pdf/2301.13670.pdf

Summary of paper



  • in this paper, the main focus is on an emergent ability in large vision models, known. as in-context learning
  • this concept has been well-known in natural language processing but has only been studied very recently for large vision models.


  • we for the first time provide a comprehensive investigation on the impact of in-context examples in computer vision, and find that the performance is highly sensitive to the choice of in-context examples.
  • exposing a critical issue that different in-context examples could lead to drastically different results.
    • Our methods obtain significant improvements over random selection under various problem settings, showing the potential of using prompt retrieval in vision applications with a Model-as-a-Service (MaaS) business structure.
  • we show that a good in-context example should be semantically similar to the query and closer in context.
  • A model that can better balance spatial and se- mantic closedness in feature space would be more ideal for visual in-context learning.
    • yeah, it is because the model is not that smart in a way that it can directly tell the semantic regardless of what the spatial structure looks like

Some key terms

existing issue of using LLM

  • Entities able to develop large-scale models typically only provide users with APIs, known as Model-as-a-Service (Maas). Representative examples include GPT-3. As a result, users are unable to apply full fine-tuning or some parameter-efficient tuning techniques, such as prompt learning for model adaption, largely limiting downstream performance

in-context learning

  • without the need to update any parameter for previously unseen tasks, in-context learning simply prepends some domain-specific input-output pairs, called in-context example or prompt, to a test example, which together guide the model to produce an ideal result.
  • in computer vision, we can pretrained a neural network to fill missing patches in grid-like images, which allows the model to perform in-context learning for unseen tasks like image segmentation.

sensitivity to the prompt selection

  • choosing a good in-context example is essential for the performance
  • image-20230207145853061


  1. Using IoU of the segmentation to rank the in-context examples
    1. or using human labelling
    2. image-20230207150623594
  2. after that, train a learnable feature extractor such that the cosine distance between the feature vector of two similar images should be small, while two dissimilar images should have large cosine distance (contrastive learning)
  3. the trained feature extractor helps to retrieve in-context examples from a large dataset.
    1. looks like this method assists the model to do an interpolation