Alexander_kirillov Segment Anything 2023

[TOC]

Summary of paper

we introduce the segment anything project: a new task, model and dataset for image segmentation.
Using the model in a data collection loop, we built the largest segmentation dataset to date.

the model is designed and trained to be promptable, so it can transfer zero-shot to new images distributions and tasks.

CLIP and ALIGN use contrastive learning to train text and image encoders that align the two modalities.

goal of the authors

the plan hinges on three components

promptable segmentation mask

model for the promptable segmentation task

Segment Anything task

we start by translating the idea of a prompt from NLP to segmentation, where a prompt can be a set of foreground /background points, a rough box or mask, free-form text
the requirement of a valid mask simply means that even when a prompt is ambiguous and could refer to multiple objects, the output should be a reasonable mask for at least one of those objects.
- this is similar to expecting a language model to output a coherent response to an ambiguous prompt.

interactive segmentation

interactive segmentation is a technique for picking objects of interest in images according to users’ input interactions.

Resolving ambiguity

With one output, the model will average multiple valid masks if given an ambiguous prompt. To address this, we modify the model to predict multiple output masks for a single prompt
this require human annotation though
during training, we backprop only minimum loss over masks.