vision transformer
简介
Embedding层详解
适用于vison transformer的位置编码
关于相对位置编码的各种疑问:
- 从1D扩展到2D的相对位置编码会适用于计算机视觉吗?
- 在相对位置编码技术中像素块组之间的相对方向性是否是一个重要的特征?
- Bias Mode and Contextual Mode哪个模型的效果更好?
- 在相对位置编码中相对位置信息应该添加在QKV三个位置的哪个位置。
基本方案简介:
- clip(x, k) = max(−k, min(k, x))
Bias Mode and Contextual Mode:
两种无向的相对位置编码技术:
Euclidean method:
Quantization method.
In the above Euclidean method, the closer two neighbors with different relative distancesmay be mapped into the same index, e.g. the 2D relative positions (1, 0) and (1, 1) are both mapped into the index 1. We suppose that the close neighbors should be separated. Therefore, we quantize Euclidean distance, i.e., different real number is mapped into different integer.
The operation quant(·) maps a set of real numbers {0, 1,1.41, 2, 2.24, ...} into a set of integers {0, 1, 2, 3, 4, ...}.This method is also undirected.
两种有向的相对位置编码技术:
Cross method.
Positional direction of pixels is also important for images, we thereby propose directed mapping methods. This method is called Cross method, which computes encoding on horizontal and vertical directions separately, then summarizes them. The method is given as follows,
Product method.
The Cross method encodes different relative positions into the same embedding if the distance on one direction is identical, either horizontal or vertical Besides, the addition operation in Eq. (22) brings extra computational cost. To improve efficiency and involve more directional information, we design Product method which is formulated as below
评论
发表评论