vision transformer

简介

从深度学习暴发以来，CNN一直是CV领域的主流模型，而且取得了很好的效果，相比之下，基于self-attention结构的Transformer在NLP领域大放异彩。虽然Transformer结构已经成为NLP领域的标准，但在计算机视觉领域的应用还非常有限。

ViT（vision transformer）是Google在2020年提出的直接将Transformer应用在图像分类的模型，通过这篇文章的实验，给出的最佳模型在ImageNet1K上能够达到88.55%的准确率（先在Google自家的JFT数据集上进行了预训练），说明Transformer在CV领域确实是有效的，而且效果还挺惊人。

Embedding层详解

从图中看，Transformer Encoder读入的是一小块一小块这样的图片。这样做的原因是：将一个个小块图像视为token（token可以理解为NLP中的一个字或词），在Transformer中计算每个token之间的相关性。这一点就和卷积神经网络有很大区别了。以往的CNN，以卷积 + 池化的方式不断下采样，这样理论上模型可以通过加深模型深度，达到增大感受野的目的。不过这样会有两个缺点：

实际结果中显示，CNN对边缘的响应很弱。这也非常好理解，越靠边缘的像素，因为被卷积次数少，自然在梯度更新时，贡献更少。

CNN只能和临近像素计算相关性。由于其滑窗卷积的特性，无法对非领域的像素共同计算，例如左上角的像素无法和右下角的像素联合卷积。这就导致了某些空间信息是无法利用的。同时根据MAE论文中所说的，自然图像具有冗余性，即相邻像素点代表的信息是差不多的，所以只计算领域像素无法最大化利用图像特征。

回到ViT中，仅仅把图像拆分成小块（patch）是不够的，Transformer Encoder需要的是一个向量，shape为[num_token, token_dim]。对于图片数据来说，shape为[H,W,C]是不符合要求的，所以就需要转换，要将图片数据通过这个Embedding层转换成token。以ViT-B/16为例：

假设输入图像为224x224x3，一个token原始图像shape为16x16x3，那这样就可以将图像拆分成(224/16)^2 = 196个patch，然后将每个patch线性映射至一维向量中，那么这个一维向量的长度即为16 ∗ 16 ∗ 3 = 768维。将196个token叠加在一起最后维度就是[196, 768]。

在代码实现中，patch的裁剪是用一个patch_size大小的卷积以image_size // patch_size的步长进行卷积实现的：

class PatchEmbed(nn.Module):

"""

2D Image to Patch Embedding

"""

def __init__(self, image_size=224, patch_size=16, in_c=3, embed_dim=768, norm_layer=None):

"""

Map input tensor to patch.

Args:

image_size: input image size

patch_size: patch size

in_c: number of input channels

embed_dim: embedding dimension. dimension = patch_size * patch_size * in_c

norm_layer: The function of normalization

"""

super().__init__()

image_size = (image_size, image_size)

patch_size = (patch_size, patch_size)

self.image_size = image_size

self.patch_size = patch_size

self.grid_size = (image_size[0] // patch_size[0], image_size[1] // patch_size[1])

self.num_patches = self.grid_size[0] * self.grid_size[1]

# The input tensor is divided into patches using 16x16 convolution

self.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)

self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

def forward(self, x):

B, C, H, W = x.shape

assert H == self.image_size[0] and W == self.image_size[1], \

f"Input image size ({H}*{W}) doesn't match model ({self.image_size[0]}*{self.image_size[1]})."

# flatten: [B, C, H, W] -> [B, C, HW]

# transpose: [B, C, HW] -> [B, HW, C]

x = self.proj(x).flatten(2).transpose(1, 2)

x = self.norm(x)

return x

同时需要注意的是，token可以添加位置编码。

适用于vison transformer的位置编码

在2021的一篇文章中中山大学与微软亚洲研究院的各位学者就适用于计算机视觉的self-attention位置编码技术进行了一系列的探索与研究并取得了显著的成果，因此这里不再介绍vison transformer原文中的位置编码转而记录一下他们的研究成果。

首先是这篇文章的研究动机：自2018年transformer模型被提出以来在NLP(自然语言处理)领域取得了显著的成功，并因此成为机器翻译，语义分割等方向的基础性架构。看到其在NLP领域的成功近些年来也有不少学者尝试着将它应用于计算机视觉。不过由于输入序列的不同在关于NLP领域被广泛应用的相对位置编码技术是否同样适用于计算机视觉领域的问题始终没有答案。文中列举了近些年的许多相关论文他们各执一词也说不清相对位置编码技术在计算机视觉领域到底有没有用。鉴于此，本文的几位作者提出并实践了4种相对位置编码技术将它们插入vison transformer并进行相同的训练与原有的绝对位置编码进行对比，对比领域包含了图片分类、目标检测。

关于相对位置编码的各种疑问：

从1D扩展到2D的相对位置编码会适用于计算机视觉吗？
在相对位置编码技术中像素块组之间的相对方向性是否是一个重要的特征？
Bias Mode and Contextual Mode哪个模型的效果更好？
在相对位置编码中相对位置信息应该添加在QKV三个位置的哪个位置。

面对上述的前三个问题，作者提出并实践了8种相对位置编码技术，并对它们进行了对比。这八种位置编码主要包括了4种bias model以及4种contextual model在这四种中又分别包括了2种考虑相对方向性的编码以及两种没有考虑相对方向性的编码。而对于第四个问题作者依旧使用的上面的8种编码技术并改变位置达到对比的效果。

基本方案简介：

同为相对位置编码因此上述的8种编码方式在很大的一部分是完全相同的，为了更好地理解工作原理，在介绍这8种位置编码技术之前先对文中的相对位置编码进行一个框架性的简介，简述其工作流程。

两种位置映射函数：

clip(x, k) = max(−k, min(k, x))

在g(x)中sign(x)是一个阶跃函数在想x<0时sign（x）== -1而当x > 0 时sign（x） == 1。式中的α是分段点。而β是最大值，γ用于控制对数函数曲率。

在计算机视觉任务中我们认为相对距离较近的像素块之间的相关性更大而且在面对大分辨率图像时，由于像素块数量巨大对它们的每一对之间都设置相对信息将导致计算量呈平方增长并不利于魔性的简化。因此我们着重区分相对距离较近位置区别。将两个像素块之间的相对横轴距离与相对纵轴距离通过上述的函数映射成（-β，β）之间的整数。并将结果作为下标到相对位置编码库中查询对应的相对位置信息。相对位置编码库中存储的相对位置信息在contextual模型中是一个与模型维度相同的向量在bias模型中则是一个可供学习的变量，具体的将在下面进行进一步的说明。除了contextual与bias的区别外位于他们各自内部的四个模型的区别就在于g(x)或者clip(x, k)中x的计算公式。

Bias Mode and Contextual Mode：

Previous relative position encoding methods all depend on input embeddings.It brings a question, i.e., whether the encoding can be inde-pendent of the input? We introduce bias mode and contex-tual mode of relative position encoding to study the question.

The former one is independent of input embeddings,while the latter one considers the interaction with queries, keys or values. More specifically, we introduce a unified formulation as

where bij ∈ R is the 2D relative position encoding, defining the bias or contextual mode. For bias mode,

where rij ∈ R is a learnable scalar(标量) and represents the relative position weight between the position i and j. For contextual mode,

where rij ∈ Rdz is a trainable vector, interacted with the query embedding. There are multiple variants for bij in contextual mode.For example, the relative position encoding operated on both queries and keys can be presented as

where rKij , rQij ∈ Rdz are both learnable vectors. Besides, contextual mode can also be applied on value embeddings,

where rVij ∈ Rdz. The relative position weights rQij, rKij and rVij can be constructed in the same way. For a unified representation, we use rij to denote them in bias mode and contextual mode in the following discussion.

两种无向的相对位置编码技术：

we propose two undirected mapping methods, namely Euclidean and Quantization, as well as two directed mapping methods, namely Cross and Product.

Euclidean method:

On image plane, the relative position(˜xi−˜xj, ˜yi− ˜yj) is a 2D coordinate. We compute Euclidean distance between two positions, and maps the distance into the corresponding encoding. The method is undirected and formulated as

where pI(i,j) is either a learnable scalar in bias mode or a vector in contextual mode. We regard pI(i,j) as a bucket, which stores the relative position weight. The number of

buckets is 2β + 1.

Quantization method.

In the above Euclidean method, the closer two neighbors with different relative distancesmay be mapped into the same index, e.g. the 2D relative positions (1, 0) and (1, 1) are both mapped into the index 1. We suppose that the close neighbors should be separated. Therefore, we quantize Euclidean distance, i.e., different real number is mapped into different integer.

The operation quant(·) maps a set of real numbers {0, 1,1.41, 2, 2.24, ...} into a set of integers {0, 1, 2, 3, 4, ...}.This method is also undirected.

两种有向的相对位置编码技术：

Cross method.

Positional direction of pixels is also important for images, we thereby propose directed mapping methods. This method is called Cross method, which computes encoding on horizontal and vertical directions separately, then summarizes them. The method is given as follows,

where p˜xI(i,j) and p˜yI(i,j) are both learnable scalars in bias mode, or a learnable vectors in contextual mode. Similar to the encoding in SASA [17], the same offsets on x-axis or y-axis share the same encoding, but the main difference is that we use a piecewise function to distribute attention by relative distance. The number of buckets is 2 × (2β + 1).

Product method.

The Cross method encodes different relative positions into the same embedding if the distance on one direction is identical, either horizontal or vertical Besides, the addition operation in Eq. (22) brings extra computational cost. To improve efficiency and involve more directional information, we design Product method which is formulated as below

The right side of the equation is a trainable scalar in bias mode, or a trainable vector in contextual mode. I ˜x(i, j) and I ˜y(i, j) are defined in Eq. (23) and Eq. (24), and the combination of them is a 2D index for p. The number of buckets is (2β + 1)^2.