CLIP系列学习（八） — CLIPSeg

type

status

date

slug

summary

CLIPSeg: Zero/one-shot segmentation CVPR 2022

项目地址：

clipseg

timojl • Updated Jan 9, 2025

文章时间：2022

主要能力：

zero-shot 分割

one-shot 分割

referring expression segmentation

Kimi总结：

问题一：创新点在哪里

这篇论文的创新点主要体现在以下几个方面：

灵活的图像分割系统：提出了一个系统，能够基于任意的文本或图像提示（prompts）在测试时生成图像分割，而不需要针对新类别或复杂查询重新训练模型。

统一模型处理多种任务：构建了一个统一的模型，能够处理三种常见的分割任务：指代表达式分割（referring expression segmentation）、零样本分割（zero-shot segmentation）和单样本分割（one-shot segmentation）。

基于CLIP模型的扩展：利用预训练的CLIP模型作为骨干网络，并在其上扩展了一个基于Transformer的解码器来实现密集预测。

混合输入方式：模型可以接受文本或图像作为输入，这种混合输入方式允许模型动态适应各种二元分割任务，其中文本或图像查询可以被制定。

对复杂查询的适应性：模型能够很好地适应涉及功能或属性的泛化查询。

问题二：用了什么数据集去训练和评测

论文中提到了以下数据集用于训练和评估：

PhraseCut数据集：用于训练和评估指代表达式分割任务。论文中还扩展了PhraseCut数据集，增加了视觉支持样本和负样本，称为PhraseCut+（PC+）。

Pascal-VOC数据集：用于评估零样本分割性能，包含了2到10个未见类别的分割。

COCO-20i数据集：用于评估单样本分割性能。

LVIS测试数据集：用于评估模型对涉及功能或属性的泛化查询的性能。

transformers 的例子：

官方 jupyter notebok：

CLIPSeg 预测机制
We employ a generic binary prediction setting, where a foreground that matches the prompt has to be differentiated from background. This binary setting can be adapted to multi-label predictions which is needed by Pascal zero-shot segmentation
Contributions 贡献
Our main technical contribution is the CLIPSeg model, which extends the well-known CLIP transformer for zero-shot and one-shot segmentation tasks by a proposing a lightweight transformer-based decoder. A key novelty of this model is that the segmentation target can be specified by different modalities: through text or an image.
This allows us to train a unified model for several benchmarks. For text-based queries, unlike networks trained on PhraseCut, our model is able to generalize to new queries involving unseen words. For image-based queries, we explore various forms of visual prompt engineering – analogously to text prompt engineering in language modeling. Furthermore, we evaluate how our model generalizes to novel forms of prompts involving affordances.

基本概念

Zero-Shot Segmentation

text + image → segmentation

In zero-shot segmentation the goal is to segment objects of categories that have not been seen during training. Normally, multiple classes need to be segmented in an image at the same time. In the generalized setting, both seen and unseen categories may occur.

One-Shot Segmentation

image(sample)+mask(sample) → enginered_visual_prompt

engineered_visual_prompt + image(query) → segmentation

one-shot semantic segmentation, the model is provided at test time with a single example of a certain class, usually as an image with a corresponding mask. One-shot semantic segmentation is a comparably new task

技术细节

frozen clip的 ViT-B/16 作为 backbone ,然后加一个小型的超参数效率高的Transformer Decoder

Decoder 被用作在额外的数据中进行训练并完成segmentation的任务，CLIP Encoder的参数在训练过程中保持固定

→ Decoder Architecture

虽然是个Transformer Decoder，但利用了UNet的思路，当一个 query image 输入给CLIP vision Transformer后，某一层的激活值直接被投射到新建的Decoder 的token embedding size D。然后取出来的这部分激活值（包括CLS token）与Decoder的内部激活值相加（在输入Transformer block之前）。Decoder的Transformer Block数量和取出来的激活值相同。

然后Decoder会用一个线性层把Transformer的输出变成一个一个二元的分割结果。即

，其中是CLIP中PatchSize的大小。

为了使Decoder能够对segmentation的目标进行训练，这帮人调整了Decoder输入的激活函数，用了一个叫FiLM的条件向量，This conditional vector can be obtained in two ways:

(1)Using the CLIP text-transformer embedding of a text query and

(2) using the CLIP visual transformer on a feature engineered prompt image.

→ → FiLM

https://distill.pub/2018/feature-wise-transformations/

中文翻译版本：‣

本质就是对输入和condition进行一个线性变换，包括scaling和shifting。挺常见，这个技巧在Latte中也有用到。也就是说，condition怎么去影响输入，condition分别经过两个layer，作为系数和常数对输入进行影响。在Latte中这是作为attention layer的前置步骤。

来自代码：

clipseg.py

timojl

原本的CLIP由于positional embedding 是训练好了固定的，所以对输入的Image来说图像大小要固定，这里用了一个插值的方法来允许不同的图片尺寸

clipseg.py

timojl

这里的线性插值是不学习的，只是做了内插而已。直接加上去了

In our experiments we use CLIP ViT-B/16 with a patch size P of 16 and use a projection dimension of D=64 if not indicated otherwise. We extract CLIP activations at layers S= [3, 7, 9], consequently our decoder has only three layers

这里应该是作者写错了，应该是 3,6,9。代码里是这样的：

Our model receives information about the segmentation target (“what to segment?”) through a conditional vector. This can be provided either by text or an image (through visual prompt engineering). Since CLIP uses a shared embedding space for images and text captions,we can interpolate between both in the embedding space and condition on the interpolated vector. Formally, let be the embedding of the support image and the text embedding of a sample , we obtain a conditional vector by a linear interpolation , where is sampled uniformly from [0, 1]

【conditional的数据增强】这里还对嵌入空间和conditional空间做了线性插值(把图片、文字的embedding进行统一)。conditional vector是通过图片或者文字来的:

训练细节

→ 数据集构造

We use the PhraseCut dataset [20], which encompasses over 340,000 phrases with corresponding image segmentations. Originally, this dataset does not contain visual support but only phrases and for every phrase a corresponding object exists. We extend this dataset in two ways: visual support samples and negative samples.

https://github.com/ChenyunWu/PhraseCutDataset

什么是Visual support？

这里的Visual Support 其实就是在conditional里输入的图片，也就是参考分割图片而非实际分割图片。

为了给一个prompt 找一个对应的Visual support，在数据集里找出了所有的包含prompt 的图片。当仅有一张图对应这个prompt的时候，我们仅依赖这个prompt，而不去匹配对应的图片

什么是Negative Samples？

也就是CLIP 中的负例对，这里就简单的找那些图片中不包含该prompt的就行

4 Visual Prompt Engineering

【PromptEngineering】

首先，犹如CLIP中所说，Prompt Engineering是管用的，比如在分类时，将简单的类别名称之为 a photo of xxx，效果会更好，那么视觉部分又该如何进行Engineering呢，这里提出了一个叫Visual prompt Engineering的东西。

为了找到更好的在推理时的visual support的方法，这里用一个概率差进行分析。提出了几个Prompt方法：

这里是在说当CLIPSeg用作目标分类（推理）的问题时，相应得到的模型mask怎么应用才能效果最大化的问题。

对图片进行不同的处理，保持text一样，在CLIP中得到的结果也很不一样。在我们的系统性的分析中，我们用最左边和最右边的概率差来进行分析

文中提到，在CNN-based network中，可以以global average pooling的方式去获得一个图片的全局性特征，但是在Transformer-based里无法通过这种形式去获得。所以引入了众所周知的[cls] token来代表一张输入图片的全局特征。这里还是用的cosine similarity来计算图像和文本的相似度，这里的相似度就是Figure3中得到的分数。不同的visual_highlighting 对应不同对图片进行高亮的方法

实验部分Experiments

Metrics

our models are trained to generate binary predictions that indicate where objects matching the query are located. If necessary, this binary setting can be transformed into a multi-label setting

Foreground IoU:只计算前景的IoU

mean IoU：计算不同类别的前景IoU的平均分

binary IoU：平均前景IoU和背景IoU的得分

在二元IoU中，需要一个阈值，很多t都自然而然的被设置为0.5，这里做了个实验：

Models and Baselines

CLIP-Deconv ：用了CLIP的Encoder，但用了一个特别基础的decoder，只有基本的部分：FiLM，线性层和一个deconvolution层。

ViTSeg：和CLIPSeg有同样的结构，但没用CLIP Encoder，用了ImageNet 预训练好的visual transformer 来做encoder。对于text encoder，用了CLIP一样的。