The Intro

Attention Is All You Need | arxiv

Pytorch Transformers from Scratch (Attention is all you need) | Aladdin Persson | YouTube

TRANSFORMERS FROM SCRATCH | blog

transformer_from_scratch.py | GitHub

Attention and Q,K,V

Queries, Keys, and Values are terms from the field of Recommendation Algorithms.

There is a collection of many key-value pairs D={(k1,v1),(k2,v2),,(km,vm)}D = \{(k_1,v_1),(k_2,v_2),\cdots,(k_m,v_m)\} and a query qq. You need to find a key-value pair (k?,v?)(k_?,v_?) that is best for your query.

Attention(q,D)=defi=1mα(q,ki)vi=[α(q,k1)α(q,km)][v1vm]\begin{split} \text{Attention}(q,D) &\overset{def}{=} \sum_{i=1}^m \alpha(q,k_i)v_i\\ &= \begin{bmatrix}\alpha(q,k_1) & \cdots & \alpha(q,k_m)\end{bmatrix} \begin{bmatrix}v_1 \\ \vdots \\ v_m\end{bmatrix} \end{split}

where α(q,ki)R\alpha(q,k_i)\in\R are scalar attention weights. The operation itself is typically referred to as attention pooling. The name attention derives from the fact that the operation pays particular attention to the terms for which the weight α\alpha is significant (i.e., large).
As such, the attention over DD generates a linear combination of values contained in the database.

We could apply softmax operation to [α(q,k1)α(q,km)][\alpha(q,k_1)\cdots\alpha(q,k_m)] in order to make the weights nonnegative and also sum up to 1.

Attention(q,D)=softmax([α(q,k1)α(q,km)])[v1vm]\text{Attention}(q,D) = \text{softmax}\left(\begin{bmatrix}\alpha(q,k_1) & \cdots & \alpha(q,k_m)\end{bmatrix}\right) \begin{bmatrix}v_1 \\ \vdots \\ v_m\end{bmatrix}

Expecially, when q,k,vq,k,v are all vectors (row vectors) and the function α(,)\alpha(\cdot,\cdot) is vector dot-product, we get scaled dot-product Dot-Product Attention that is

Attention(q,D)=softmax(q[k1T,,kmT])[v1vm]=softmax(qKT)V\begin{split} \text{Attention}(q,D) &= \text{softmax}\left(q\cdot [k^T_1,\cdots, k^T_m]\right) \begin{bmatrix}v_1 \\ \vdots \\ v_m\end{bmatrix}\\ &= \text{softmax}(qK^T)V \end{split}

Self-Attention

The “self” in “self-attention” means that there is no a collection of key-value pairs and no query; instead, query, key and value are all come from the input itself.

Input Embedding

{x1,x2,,xn}Input Embeddingf(){a1,a2,,an}\{x_1, x_2, \cdots, x_n\} \xrightarrow[\text{Input Embedding}]{f(\cdot)} \{a_1, a_2, \cdots, a_n\}

x1,,xnx_1,\cdots,x_n is the original input sequence, and a1,,ana_1,\cdots,a_n is the sequence linerly embedded to higher dimension from x1,,xnx_1,\cdots,x_n. Their elements are all row vectors.

Q,K,V

Q=[q1q2qn]n×dq=[a1a2an]n×daWqda×dq,K=[k1k2kn]n×dk=[a1a2an]n×daWkda×dk,V=[v1v2vn]n×dv=[a1a2an]n×daWvda×dvQ = \underset{n\times d_q}{ \begin{bmatrix} q_1 \\ q_2 \\ \vdots \\ q_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_q}{W^q}, \quad K = \underset{n\times d_k}{ \begin{bmatrix} k_1 \\ k_2 \\ \vdots \\ k_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_k}{W^k}, \quad V = \underset{n\times d_v}{ \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_v}{W^v}

dq=dk=?dvd_q = d_k \overset{?}{=} d_v

Attention

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V

Note: dkd_k is the dimention of k1nk_{1\cdots n} rather than dimention of KK which is (n×dk)(n\times d_k).

QKT=[q1q2qn][k1Tk2TknT]=[q1k1Tq1k2Tq1knTq2k1Tq2k2Tq2knTqnk1Tqnk2TqnknT]QK^T = \begin{bmatrix} q_1 \\ q_2 \\ \vdots \\ q_n \end{bmatrix} \cdot \begin{bmatrix} k_1^T & k_2^T & \cdots & k_n^T \end{bmatrix} = \begin{bmatrix} q_1k_1^T & q_1k_2^T & \cdots & q_1k_n^T\\ q_2k_1^T & q_2k_2^T & \cdots & q_2k_n^T\\ \vdots & \vdots & \ddots & \vdots\\ q_nk_1^T & q_nk_2^T & \cdots & q_nk_n^T\\ \end{bmatrix}

softmax(QKTdk)=[softmax(q1k1Tdkq1k2Tdkq1knTdk)softmax(q2k1Tdkq2k2Tdkq2knTdk)softmax(qnk1Tdkqnk2TdkqnknTdk)]\text{softmax}(\frac{QK^T}{\sqrt{d_k}}) = \begin{bmatrix} \text{softmax}(\frac{q_1k_1^T}{\sqrt{d_k}} & \frac{q_1k_2^T}{\sqrt{d_k}} & \cdots & \frac{q_1k_n^T}{\sqrt{d_k}})\\ \text{softmax}(\frac{q_2k_1^T}{\sqrt{d_k}} & \frac{q_2k_2^T}{\sqrt{d_k}} & \cdots & \frac{q_2k_n^T}{\sqrt{d_k}})\\ \vdots & \vdots & & \vdots\\ \text{softmax}(\frac{q_nk_1^T}{\sqrt{d_k}} & \frac{q_nk_2^T}{\sqrt{d_k}} & \cdots & \frac{q_nk_n^T}{\sqrt{d_k}})\\ \end{bmatrix}

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}}) V

Multi-head Self Attention

Qn×dmodel=[q1q2qn]n×dmodel=[a1a2an]n×daWqda×dmodelKn×dmodel=[k1k2kn]n×dmodel=[a1a2an]n×daWkda×dmodelVn×dmodel=[v1v2vn]n×dmodel=[a1a2an]n×daWvda×dmodel\underset{n\times d_{model}}{Q} = \underset{n\times d_{model}}{ \begin{bmatrix} q_1 \\ q_2 \\ \vdots \\ q_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_{model}}{W^q}\\ \underset{n\times d_{model}}{K} = \underset{n\times d_{model}}{ \begin{bmatrix} k_1 \\ k_2 \\ \vdots \\ k_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_{model}}{W^k}\\ \underset{n\times d_{model}}{V} = \underset{n\times d_{model}}{ \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_{model}}{W^v}

Let dq=dk=dv=dmodel/hd_q=d_k=d_v=d_{model}/h

MultiHead(Q,K,V)=Concat(head1,,headh)WOheadi=Attention(QWiQ,KWiK,VWiV)\text{MultiHead}(Q,K,V) = \text{Concat}(head_1,\cdots,head_h)W^O\\ head_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)

In the paper Attention is all you need, the auther emploied dmodel=512d_{model} = 512, h=8h = 8 and dq=dk=dv=dmodel/h=64d_q = d_k = d_v = d_{model}/h = 64.

QQW1Q,QW2Q,,QWhQKKW1K,KW2K,,KWhKVVW1V,VW2V,,VWhV\begin{split} Q &\rightarrow QW^Q_1,QW^Q_2,\cdots,QW^Q_h\\ K &\rightarrow KW^K_1,KW^K_2,\cdots,KW^K_h\\ V &\rightarrow VW^V_1,VW^V_2,\cdots,VW^V_h \end{split}

W1QW8QW^Q_1 \cdots W^Q_8, W1KW8KW^K_1 \cdots W^K_8 and W1VW8VW^V_1 \cdots W^V_8 are matrixs as the following:

1264512[100010001000000],16566128512[000000100010001000000],,1448449512[000000100010001]\begin{array}{} \begin{matrix} 1 \\ 2 \\ \vdots \\ 64 \\ \\ \vdots \\ 512 \end{matrix} \begin{bmatrix} \textcolor{red}{1} & 0 & \cdots & 0\\ 0 & \textcolor{red}{1} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \textcolor{red}{1}\\ 0 & 0 & \cdots & 0\\ \vdots & \vdots && \vdots\\ 0 & 0 & \cdots & 0 \end{bmatrix} \end{array}, \quad \begin{array}{} \begin{matrix} 1 \\ \vdots \\ \\ 65 \\ 66 \\ \vdots \\ 128 \\ \\ \vdots \\ 512 \end{matrix} \begin{bmatrix} 0 & 0 & \cdots & 0\\ \vdots & \vdots && \vdots\\ 0 & 0 & \cdots & 0\\ \textcolor{red}{1} & 0 & \cdots & 0\\ 0 & \textcolor{red}{1} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \textcolor{red}{1}\\ 0 & 0 & \cdots & 0\\ \vdots & \vdots && \vdots\\ 0 & 0 & \cdots & 0 \end{bmatrix} \end{array}, \quad \cdots, \quad \begin{array}{} \begin{matrix} 1 \\ \vdots \\ \\ 448 \\ 449 \\ \vdots \\ 512 \end{matrix} \begin{bmatrix} 0 & 0 & \cdots & 0\\ \vdots & \vdots && \vdots\\ 0 & 0 & \cdots & 0\\ \textcolor{red}{1} & 0 & \cdots & 0\\ 0 & \textcolor{red}{1} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \textcolor{red}{1} \end{bmatrix} \end{array}

In this way, Qn×512,Kn×512,Vn×512\underset{n\times 512}{Q}, \underset{n\times 512}{K}, \underset{n\times 512}{V} are splited uniformly by the number of their columns to 8 parts.

Qn×512W1W8Q1n×64,,Q8n×64Kn×512W1W8K1n×64,,K8n×64Vn×512W1W8V1n×64,,V8n×64\begin{split} \underset{n\times 512}{Q} \xrightarrow{W_1\cdots W_8} \underset{n\times 64}{Q_1},\cdots, \underset{n\times 64}{Q_8}\\ \underset{n\times 512}{K} \xrightarrow{W_1\cdots W_8} \underset{n\times 64}{K_1},\cdots, \underset{n\times 64}{K_8}\\ \underset{n\times 512}{V} \xrightarrow{W_1\cdots W_8} \underset{n\times 64}{V_1},\cdots, \underset{n\times 64}{V_8} \end{split}

expecially, we have

Qn×512=concat[Q1n×64Q8n×64]Kn×512=concat[K1n×64K8n×64]Vn×512=concat[V1n×64V8n×64]\begin{split} \underset{n\times 512}{Q} = \text{concat}[\underset{n\times 64}{Q_1}\cdots \underset{n\times 64}{Q_8}]\\ \underset{n\times 512}{K} = \text{concat}[\underset{n\times 64}{K_1}\cdots \underset{n\times 64}{K_8}]\\ \underset{n\times 512}{V} = \text{concat}[\underset{n\times 64}{V_1}\cdots \underset{n\times 64}{V_8}] \end{split}

What’s the difference between Multi-head and normal Attention

we use the same way to get Q,K,VQ,K,V from the input sequence x1,,xnx_1,\cdots,x_n by linearly embedding. Then we split the Q,K,VQ,K,V to several parts to get Q1QhQ_1\cdots Q_h, K1KhK_1\cdots K_h, V1VhV_1\cdots V_h.
Expecially, when we use special parameter matrices, we could have a special case that is:

Q=concat[Q1Qh]K=concat[K1Kh]V=concat[V1Vh]\begin{split} Q &= \text{concat}[Q_1\cdots Q_h]\\ K &= \text{concat}[K_1\cdots K_h]\\ V &= \text{concat}[V_1\cdots V_h] \end{split}

How to calculate normal Attention

Attention(Q,K,V)=Attention(cnt[Q1Qh],cnt[K1Kh],cnt[V1Vh])\text{Attention}(Q,K,V) = \text{Attention}(\text{cnt}[Q_1\cdots Q_h],\text{cnt}[K_1\cdots K_h],\text{cnt}[V_1\cdots V_h])

How to calculate multi-head Attention

MultiHead(Q,K,V)=concate[Attention(Q1,K1,V1),,Attention(Qh,Kh,Vh)]\text{MultiHead}(Q,K,V) = \text{concate}[\text{Attention}(Q_1,K_1,V_1),\cdots, \text{Attention}(Q_h,K_h,V_h)]

comparison on computation

Q=concat[Q1Qh]K=concat[K1Kh]V=concat[V1Vh]\begin{split} Q &= \text{concat}[Q_1\cdots Q_h]\\ K &= \text{concat}[K_1\cdots K_h]\\ V &= \text{concat}[V_1\cdots V_h] \end{split}

QKTV=[Q1Q2Qh][K1TK2TKhT]V=(Q1K1T+Q2K2T++QhKhT)[V1V2Vh]=[Q1K1TV1+Q2K2TV1++QhKhTV1,Q1K1TV2+Q2K2TV2++QhKhTV2,,Q1K1TVh+Q2K2TVh++QhKhTVh]\begin{split} QK^TV &= \begin{bmatrix}Q_1 & Q_2 & \cdots & Q_h\end{bmatrix} \begin{bmatrix}K^T_1 \\ K^T_2 \\ \vdots \\ K^T_h\end{bmatrix}V\\ &= (Q_1K^T_1 + Q_2K^T_2 + \cdots + Q_hK^T_h) \begin{bmatrix}V_1 & V_2 & \cdots & V_h\end{bmatrix}\\ &= \begin{bmatrix} \begin{matrix} \textcolor{red}{Q_1 K^T_1 V_1}\\ + \\ Q_2 K^T_2 V_1 \\ + \\ \vdots \\ + \\ Q_h K^T_h V_1 \end{matrix},& \begin{matrix} Q_1 K^T_1 V_2 \\ + \\ \textcolor{red}{Q_2 K^T_2 V_2} \\ + \\ \vdots \\ + \\ Q_h K^T_h V_2 \end{matrix}, & \cdots, & \begin{matrix} Q_1 K^T_1 V_h \\ + \\ Q_2 K^T_2 V_h \\ + \\ \vdots \\ + \\ \textcolor{red}{Q_h K^T_h V_h} \end{matrix} \end{bmatrix} \end{split}

AttentionMulti-Head\text{Attention} \uparrow\downarrow \text{Multi-Head}

[Q1K1TV1Q2K2TV2QhKhTVh]\begin{bmatrix} \textcolor{red}{Q_1 K^T_1 V_1}& \textcolor{red}{Q_2 K^T_2 V_2}& \textcolor{red}{Q_h K^T_h V_h} \end{bmatrix}

Something about 1dk\frac{1}{\sqrt{d_k}}

随机变量乘积的期望和方差

assume that q1,,qdkN(μq,σq)q^1,\cdots,q^{d_k} \sim N(\mu_q, \sigma_q) and k1,,kdkN(μk,σk)k^1,\cdots,k^{d_k} \sim N(\mu_k, \sigma_k)

Actually, μq=μk=0\mu_q = \mu_k = 0 and σq=σk=1\sigma_q = \sigma_k = 1

E(qkT)=E(i=1dkqiki)=i=1dkE(qiki)=i=1dk(EqiEki+cov(qi,ki))cov(qi,ki)=0independence=i=1dk(EqiEki+0)=dkμqμkμ=00\begin{split} E(q\cdot k^T) &= E(\sum_{i=1}^{d_k}q^ik^i) \\ &= \sum_{i=1}^{d_k}E(q^ik^i)\\ &= \sum_{i=1}^{d_k} \left(Eq^i\cdot Ek^i + cov(q^i, k^i) \right)\\ \xrightarrow[cov(q^i,k^i)=0]{independence} &= \sum_{i=1}^{d_k} \left(Eq^i\cdot Ek^i + 0 \right) \\ &= d_k\cdot\mu_q\cdot\mu_k \xrightarrow{\mu=0} 0 \end{split}

Var(qkT)=Var(i=1dkqiki)independence=i=1dkVar(qiki)=i=1dkVar(qi)Var(ki)+Var(qi)E2ki+E2qiVar(ki)=i=1dkσqσk+σqμk2+μq2σk=dk(σqσk+σqμk2+μq2σk)σ=1μ=0dk\begin{split} Var(q\cdot k^T) &= Var(\sum_{i=1}^{d_k}q^ik^i)\\ \xrightarrow{independence} &= \sum_{i=1}^{d_k} Var(q^ik^i)\\ &= \sum_{i=1}^{d_k} Var(q^i)\cdot Var(k^i) + Var(q^i)\cdot E^2k^i + E^2q^i\cdot Var(k^i)\\ &= \sum_{i=1}^{d_k} \sigma_q\cdot \sigma_k + \sigma_q\cdot \mu_k^2 + \mu_q^2\cdot \sigma_k\\ &= d_k \cdot (\sigma_q\cdot \sigma_k + \sigma_q\cdot \mu_k^2 + \mu_q^2\cdot \sigma_k) \xrightarrow[\sigma=1]{\mu=0} d_k \end{split}

You can infer from the equations above

E(qkTdk)=1dkE(qkT)=dkμqμkμ=00Var(qkTdk)=1dkVar(qkT)=σqσk+σqμk2+μq2σkσ=1μ=01E\left(\frac{q\cdot k^T}{\sqrt{d_k}}\right) = \frac{1}{\sqrt{d_k}}\cdot E(q\cdot k^T) = \sqrt{d_k} \cdot \mu_q \cdot \mu_k \xrightarrow{\mu=0} 0 \\ Var\left(\frac{q\cdot k^T}{\sqrt{d_k}}\right) = \frac{1}{d_k}\cdot Var(q\cdot k^T) = \sigma_q\cdot \sigma_k + \sigma_q\cdot \mu_k^2 + \mu_q^2\cdot \sigma_k \xrightarrow[\sigma=1]{\mu=0} 1

Holistic Perspective

python
1
B: Batch_size, T: Block_size (Time), C: Embedding_size (Channel)
graph LR
Input["Input: [B,T,C]"]
Q["Q: [B,T,dim_q]"] --> QK
K["K: [B,T,dim_k]"] --> QK
V["V: [B,T,dim_v]"]
QK["Q·K^T: [B,T,T]
(dim_q=dim_k)"] Input --"Wq: [C, dim_q]"--> Q Input --"Wk: [C, dim_k]"--> K Input --"Wv: [C, dim_v]"--> V; Out["(Q·K^T)·V: [B,T,dim_v]"] QK --> Out V ----> Out

Implementation in Python

Self-Attention

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import torch
from torch import nn

'''
x.shape: [n, dim_in]
Q = x @ Wq, K = x @ Wk, V = x @ Wv
attention = softmax((Q @ K^T)/sqrt(dim_k)) @ V
'''
class SelfAttention(nn.Module):
def __init__(self, dim_in, dim_q, dim_k, dim_v):
super(SelfAttention, self).__init__()
assert dim_k == dim_q # dim_k == dim_q
self.dim_in = dim_in
self.dim_q = dim_k
self.dim_k = dim_k
self.dim_v = dim_v
self.linear_q = nn.Linear(dim_in, dim_q, bias=False) # Wq
self.linear_k = nn.Linear(dim_in, dim_k, bias=False) # Wk
self.linear_v = nn.Linear(dim_in, dim_v, bias=False) # Wv
self.norm = (dim_k)**(1/2)

def forward(self, x):
'''x: n, dim_in'''
assert x.shape[-1] == self.dim_in
q = self.linear_q(x) # q = x @ Wq
k = self.linear_k(x) # k = x @ Wk
v = self.linear_v(x) # v = X @ Wv

attention = torch.mm(q, k.transpose(0,1)) / self.norm
attention = nn.Softmax(-1)(attention)
attention = torch.mm(attention, v)

return attention

if __name__ == "__main__":
input = torch.rand(3, 16)
attention = SelfAttention(dim_in=16, dim_q=8, dim_k=8, dim_v=16)
output = attention.forward(input)
print(output.shape)

MultiHead-Attention

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import torch
from torch import nn

'''
x.shape: [n, dim_in]
Q = x @ Wq, K = x @ Wk, V = x @ Wv
Q = [Q1,..,Qh], K = [K1,...,Kh], V = [V1,...,Vh]
concat[attention(Q1,K1,V1), ..., attention(Qh,Kh,Vh)]
'''
class MultiHeadSelfAttention(nn.Module):
def __init__(self, dim_in, dim_q, dim_k, dim_v, num_heads=8):
super(MultiHeadSelfAttention, self).__init__()
assert dim_q == dim_k # dim_q == dim_k
self.dim_in = dim_in
self.dim_q = dim_q
self.dim_k = dim_k
self.dim_v = dim_v
self.num_heads = num_heads
self.linear_q = nn.Linear(dim_in, dim_q, bias=False) # Wq
self.linear_k = nn.Linear(dim_in, dim_k, bias=False) # Wk
self.linear_v = nn.Linear(dim_in, dim_v, bias=False) # Wv
self.norm_fact = (dim_k // num_heads)**(1/2)

def forward(self, x):
'''x.shape: [n, dim_in]'''
n, dim_in = x.shape
assert dim_in == self.dim_in
dim_q = self.dim_q
dim_k = self.dim_k
dim_v = self.dim_v
heads = self.num_heads

q = self.linear_q(x).reshape(n, heads, dim_q//heads).transpose(0,1) # (heads, n, dim_q//heads)
k = self.linear_k(x).reshape(n, heads, dim_k//heads).transpose(0,1) # (heads, n, dim_k//heads)
v = self.linear_v(x).reshape(n, heads, dim_v//heads).transpose(0,1) # (heads, n, dim_v//heads)

attention = torch.matmul(q, k.transpose(1,2)) / self.norm_fact # (heads, n, n)
attention = nn.Softmax(dim=-1)(attention)
attention = torch.matmul(attention, v) # (heads, n, dim_v//heads)
# concat: (n, heads, dim_v//heads) -> (n, dim_v)
attention = attention.transpose(0,1).reshape(n, dim_v) # (n, dim_v)
return attention

if __name__ == "__main__":
input = torch.rand(3, 16)
multihead = MultiHeadSelfAttention(dim_in=16, dim_q=8, dim_k=8, dim_v=16, num_heads=8)
output = multihead.forward(input)
print(output.shape)

MSA of Images

How to calculate the Multi-head Self Attention of an image, for example one with a shape of 3×224×224 pixels?

Patches
We first splits the RGB image into non-overlapping patches. Each patch is treated as a “token”, a term from NLP meaning something like a unit. And its feature is set as a concatenation of the raw pixel RGB values.
For example, in the implementation of Swin-Transformer, they use a patch size of 4×4 and thus the feature dimension of each patch is 4 × 4 × 3 = 48.
We use patch splitting operation to transform an RGB image from shape of 3×224×224 to 48×56×56, which has 2244×2244=56×56\frac{224}{4}\times\frac{224}{4} = 56\times56 patches and each patch has 3×4×4=483\times4\times4 = 48 pixels value. Finally, we transpose or reshape the 48×56×56 image to get a 56×56×48 image.

patch splitting\fbox{$\begin{array}{} ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&· \end{array}$} \xrightarrow{\text{patch splitting}} \begin{array}{} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$}\\ \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$}\\ \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$}\\ \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \end{array}