Attention and Transformer

The Intro

Attention and Q,K,V

Queries, Keys, and Values are terms from the field of Recommendation Algorithms.

There is a collection of many key-value pairs $D = \{(k_1,v_1),(k_2,v_2),\cdots,(k_m,v_m)\}$ and a query $q$ . You need to find a key-value pair $(k_?,v_?)$ that is best for your query.

$\begin{split} \text{Attention}(q,D) &\overset{def}{=} \sum_{i=1}^m \alpha(q,k_i)v_i\\ &= \begin{bmatrix}\alpha(q,k_1) & \cdots & \alpha(q,k_m)\end{bmatrix} \begin{bmatrix}v_1 \\ \vdots \\ v_m\end{bmatrix} \end{split}$

where $\alpha(q,k_i)\in\R$ are scalar attention weights. The operation itself is typically referred to as attention pooling. The name attention derives from the fact that the operation pays particular attention to the terms for which the weight $\alpha$ is significant (i.e., large).
As such, the attention over $D$ generates a linear combination of values contained in the database.

We could apply softmax operation to $[\alpha(q,k_1)\cdots\alpha(q,k_m)]$ in order to make the weights nonnegative and also sum up to 1.

$\text{Attention}(q,D) = \text{softmax}\left(\begin{bmatrix}\alpha(q,k_1) & \cdots & \alpha(q,k_m)\end{bmatrix}\right) \begin{bmatrix}v_1 \\ \vdots \\ v_m\end{bmatrix}$

Expecially, when $q,k,v$ are all vectors (row vectors) and the function $\alpha(\cdot,\cdot)$ is vector dot-product, we get scaled dot-product Dot-Product Attention that is

$\begin{split} \text{Attention}(q,D) &= \text{softmax}\left(q\cdot [k^T_1,\cdots, k^T_m]\right) \begin{bmatrix}v_1 \\ \vdots \\ v_m\end{bmatrix}\\ &= \text{softmax}(qK^T)V \end{split}$

Self-Attention

The “self” in “self-attention” means that there is no a collection of key-value pairs and no query; instead, query, key and value are all come from the input itself.

Input Embedding

$\{x_1, x_2, \cdots, x_n\} \xrightarrow[\text{Input Embedding}]{f(\cdot)} \{a_1, a_2, \cdots, a_n\}$

$x_1,\cdots,x_n$ is the original input sequence, and $a_1,\cdots,a_n$ is the sequence linerly embedded to higher dimension from $x_1,\cdots,x_n$ . Their elements are all row vectors.

Q,K,V

$Q = \underset{n\times d_q}{ \begin{bmatrix} q_1 \\ q_2 \\ \vdots \\ q_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_q}{W^q}, \quad K = \underset{n\times d_k}{ \begin{bmatrix} k_1 \\ k_2 \\ \vdots \\ k_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_k}{W^k}, \quad V = \underset{n\times d_v}{ \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_v}{W^v}$

$d_q = d_k \overset{?}{=} d_v$

Attention

$\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$

Note: $d_k$ is the dimention of $k_{1\cdots n}$ rather than dimention of $K$ which is $(n\times d_k)$ .

$QK^T = \begin{bmatrix} q_1 \\ q_2 \\ \vdots \\ q_n \end{bmatrix} \cdot \begin{bmatrix} k_1^T & k_2^T & \cdots & k_n^T \end{bmatrix} = \begin{bmatrix} q_1k_1^T & q_1k_2^T & \cdots & q_1k_n^T\\ q_2k_1^T & q_2k_2^T & \cdots & q_2k_n^T\\ \vdots & \vdots & \ddots & \vdots\\ q_nk_1^T & q_nk_2^T & \cdots & q_nk_n^T\\ \end{bmatrix}$

$\text{softmax}(\frac{QK^T}{\sqrt{d_k}}) = \begin{bmatrix} \text{softmax}(\frac{q_1k_1^T}{\sqrt{d_k}} & \frac{q_1k_2^T}{\sqrt{d_k}} & \cdots & \frac{q_1k_n^T}{\sqrt{d_k}})\\ \text{softmax}(\frac{q_2k_1^T}{\sqrt{d_k}} & \frac{q_2k_2^T}{\sqrt{d_k}} & \cdots & \frac{q_2k_n^T}{\sqrt{d_k}})\\ \vdots & \vdots & & \vdots\\ \text{softmax}(\frac{q_nk_1^T}{\sqrt{d_k}} & \frac{q_nk_2^T}{\sqrt{d_k}} & \cdots & \frac{q_nk_n^T}{\sqrt{d_k}})\\ \end{bmatrix}$

$\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}}) V$

Multi-head Self Attention

$\underset{n\times d_{model}}{Q} = \underset{n\times d_{model}}{ \begin{bmatrix} q_1 \\ q_2 \\ \vdots \\ q_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_{model}}{W^q}\\ \underset{n\times d_{model}}{K} = \underset{n\times d_{model}}{ \begin{bmatrix} k_1 \\ k_2 \\ \vdots \\ k_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_{model}}{W^k}\\ \underset{n\times d_{model}}{V} = \underset{n\times d_{model}}{ \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} } = \underset{n\times d_a}{ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} } \cdot \underset{d_a\times d_{model}}{W^v}$

Let $d_q=d_k=d_v=d_{model}/h$

$\text{MultiHead}(Q,K,V) = \text{Concat}(head_1,\cdots,head_h)W^O\\ head_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)$

In the paper Attention is all you need, the auther emploied $d_{model} = 512$ , $h = 8$ and $d_q = d_k = d_v = d_{model}/h = 64$ .

$\begin{split} Q &\rightarrow QW^Q_1,QW^Q_2,\cdots,QW^Q_h\\ K &\rightarrow KW^K_1,KW^K_2,\cdots,KW^K_h\\ V &\rightarrow VW^V_1,VW^V_2,\cdots,VW^V_h \end{split}$

$W^Q_1 \cdots W^Q_8$ , $W^K_1 \cdots W^K_8$ and $W^V_1 \cdots W^V_8$ are matrixs as the following:

$\begin{array}{} \begin{matrix} 1 \\ 2 \\ \vdots \\ 64 \\ \\ \vdots \\ 512 \end{matrix} \begin{bmatrix} \textcolor{red}{1} & 0 & \cdots & 0\\ 0 & \textcolor{red}{1} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \textcolor{red}{1}\\ 0 & 0 & \cdots & 0\\ \vdots & \vdots && \vdots\\ 0 & 0 & \cdots & 0 \end{bmatrix} \end{array}, \quad \begin{array}{} \begin{matrix} 1 \\ \vdots \\ \\ 65 \\ 66 \\ \vdots \\ 128 \\ \\ \vdots \\ 512 \end{matrix} \begin{bmatrix} 0 & 0 & \cdots & 0\\ \vdots & \vdots && \vdots\\ 0 & 0 & \cdots & 0\\ \textcolor{red}{1} & 0 & \cdots & 0\\ 0 & \textcolor{red}{1} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \textcolor{red}{1}\\ 0 & 0 & \cdots & 0\\ \vdots & \vdots && \vdots\\ 0 & 0 & \cdots & 0 \end{bmatrix} \end{array}, \quad \cdots, \quad \begin{array}{} \begin{matrix} 1 \\ \vdots \\ \\ 448 \\ 449 \\ \vdots \\ 512 \end{matrix} \begin{bmatrix} 0 & 0 & \cdots & 0\\ \vdots & \vdots && \vdots\\ 0 & 0 & \cdots & 0\\ \textcolor{red}{1} & 0 & \cdots & 0\\ 0 & \textcolor{red}{1} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \textcolor{red}{1} \end{bmatrix} \end{array}$

In this way, $\underset{n\times 512}{Q}, \underset{n\times 512}{K}, \underset{n\times 512}{V}$ are splited uniformly by the number of their columns to 8 parts.

$\begin{split} \underset{n\times 512}{Q} \xrightarrow{W_1\cdots W_8} \underset{n\times 64}{Q_1},\cdots, \underset{n\times 64}{Q_8}\\ \underset{n\times 512}{K} \xrightarrow{W_1\cdots W_8} \underset{n\times 64}{K_1},\cdots, \underset{n\times 64}{K_8}\\ \underset{n\times 512}{V} \xrightarrow{W_1\cdots W_8} \underset{n\times 64}{V_1},\cdots, \underset{n\times 64}{V_8} \end{split}$

expecially, we have

$\begin{split} \underset{n\times 512}{Q} = \text{concat}[\underset{n\times 64}{Q_1}\cdots \underset{n\times 64}{Q_8}]\\ \underset{n\times 512}{K} = \text{concat}[\underset{n\times 64}{K_1}\cdots \underset{n\times 64}{K_8}]\\ \underset{n\times 512}{V} = \text{concat}[\underset{n\times 64}{V_1}\cdots \underset{n\times 64}{V_8}] \end{split}$

What’s the difference between Multi-head and normal Attention

we use the same way to get $Q,K,V$ from the input sequence $x_1,\cdots,x_n$ by linearly embedding. Then we split the $Q,K,V$ to several parts to get $Q_1\cdots Q_h$ , $K_1\cdots K_h$ , $V_1\cdots V_h$ .
Expecially, when we use special parameter matrices, we could have a special case that is:

$\begin{split} Q &= \text{concat}[Q_1\cdots Q_h]\\ K &= \text{concat}[K_1\cdots K_h]\\ V &= \text{concat}[V_1\cdots V_h] \end{split}$

How to calculate normal Attention

$\text{Attention}(Q,K,V) = \text{Attention}(\text{cnt}[Q_1\cdots Q_h],\text{cnt}[K_1\cdots K_h],\text{cnt}[V_1\cdots V_h])$

How to calculate multi-head Attention

$\text{MultiHead}(Q,K,V) = \text{concate}[\text{Attention}(Q_1,K_1,V_1),\cdots, \text{Attention}(Q_h,K_h,V_h)]$

comparison on computation

$\begin{split} Q &= \text{concat}[Q_1\cdots Q_h]\\ K &= \text{concat}[K_1\cdots K_h]\\ V &= \text{concat}[V_1\cdots V_h] \end{split}$

$\begin{split} QK^TV &= \begin{bmatrix}Q_1 & Q_2 & \cdots & Q_h\end{bmatrix} \begin{bmatrix}K^T_1 \\ K^T_2 \\ \vdots \\ K^T_h\end{bmatrix}V\\ &= (Q_1K^T_1 + Q_2K^T_2 + \cdots + Q_hK^T_h) \begin{bmatrix}V_1 & V_2 & \cdots & V_h\end{bmatrix}\\ &= \begin{bmatrix} \begin{matrix} \textcolor{red}{Q_1 K^T_1 V_1}\\ + \\ Q_2 K^T_2 V_1 \\ + \\ \vdots \\ + \\ Q_h K^T_h V_1 \end{matrix},& \begin{matrix} Q_1 K^T_1 V_2 \\ + \\ \textcolor{red}{Q_2 K^T_2 V_2} \\ + \\ \vdots \\ + \\ Q_h K^T_h V_2 \end{matrix}, & \cdots, & \begin{matrix} Q_1 K^T_1 V_h \\ + \\ Q_2 K^T_2 V_h \\ + \\ \vdots \\ + \\ \textcolor{red}{Q_h K^T_h V_h} \end{matrix} \end{bmatrix} \end{split}$

$\text{Attention} \uparrow\downarrow \text{Multi-Head}$

$\begin{bmatrix} \textcolor{red}{Q_1 K^T_1 V_1}& \textcolor{red}{Q_2 K^T_2 V_2}& \textcolor{red}{Q_h K^T_h V_h} \end{bmatrix}$

Something about $\frac{1}{\sqrt{d_k}}$

随机变量乘积的期望和方差

assume that $q^1,\cdots,q^{d_k} \sim N(\mu_q, \sigma_q)$ and $k^1,\cdots,k^{d_k} \sim N(\mu_k, \sigma_k)$

Actually, $\mu_q = \mu_k = 0$ and $\sigma_q = \sigma_k = 1$

$\begin{split} E(q\cdot k^T) &= E(\sum_{i=1}^{d_k}q^ik^i) \\ &= \sum_{i=1}^{d_k}E(q^ik^i)\\ &= \sum_{i=1}^{d_k} \left(Eq^i\cdot Ek^i + cov(q^i, k^i) \right)\\ \xrightarrow[cov(q^i,k^i)=0]{independence} &= \sum_{i=1}^{d_k} \left(Eq^i\cdot Ek^i + 0 \right) \\ &= d_k\cdot\mu_q\cdot\mu_k \xrightarrow{\mu=0} 0 \end{split}$

$\begin{split} Var(q\cdot k^T) &= Var(\sum_{i=1}^{d_k}q^ik^i)\\ \xrightarrow{independence} &= \sum_{i=1}^{d_k} Var(q^ik^i)\\ &= \sum_{i=1}^{d_k} Var(q^i)\cdot Var(k^i) + Var(q^i)\cdot E^2k^i + E^2q^i\cdot Var(k^i)\\ &= \sum_{i=1}^{d_k} \sigma_q\cdot \sigma_k + \sigma_q\cdot \mu_k^2 + \mu_q^2\cdot \sigma_k\\ &= d_k \cdot (\sigma_q\cdot \sigma_k + \sigma_q\cdot \mu_k^2 + \mu_q^2\cdot \sigma_k) \xrightarrow[\sigma=1]{\mu=0} d_k \end{split}$

You can infer from the equations above

$E\left(\frac{q\cdot k^T}{\sqrt{d_k}}\right) = \frac{1}{\sqrt{d_k}}\cdot E(q\cdot k^T) = \sqrt{d_k} \cdot \mu_q \cdot \mu_k \xrightarrow{\mu=0} 0 \\ Var\left(\frac{q\cdot k^T}{\sqrt{d_k}}\right) = \frac{1}{d_k}\cdot Var(q\cdot k^T) = \sigma_q\cdot \sigma_k + \sigma_q\cdot \mu_k^2 + \mu_q^2\cdot \sigma_k \xrightarrow[\sigma=1]{\mu=0} 1$

Holistic Perspective

1	B: Batch_size, T: Block_size (Time), C: Embedding_size (Channel)

graph LR
Input["Input: [B,T,C]"]
Q["Q: [B,T,dim_q]"] --> QK
K["K: [B,T,dim_k]"] --> QK
V["V: [B,T,dim_v]"]
QK["Q·K^T: [B,T,T]
(dim_q=dim_k)"]
Input --"Wq: [C, dim_q]"--> Q
Input --"Wk: [C, dim_k]"--> K
Input --"Wv: [C, dim_v]"--> V;
Out["(Q·K^T)·V: [B,T,dim_v]"]
QK --> Out
V ----> Out

Implementation in Python

Self-Attention

import torch
from torch import nn

'''
x.shape: [n, dim_in]
Q = x @ Wq, K = x @ Wk, V = x @ Wv
attention = softmax((Q @ K^T)/sqrt(dim_k)) @ V
'''
class SelfAttention(nn.Module):
  def __init__(self, dim_in, dim_q, dim_k, dim_v):
    super(SelfAttention, self).__init__()
    assert dim_k == dim_q  # dim_k == dim_q
    self.dim_in = dim_in
    self.dim_q = dim_k
    self.dim_k = dim_k
    self.dim_v = dim_v
    self.linear_q = nn.Linear(dim_in, dim_q, bias=False) # Wq
    self.linear_k = nn.Linear(dim_in, dim_k, bias=False) # Wk
    self.linear_v = nn.Linear(dim_in, dim_v, bias=False) # Wv
    self.norm = (dim_k)**(1/2)

  def forward(self, x):
    '''x: n, dim_in'''
    assert x.shape[-1] == self.dim_in
    q = self.linear_q(x) # q = x @ Wq
    k = self.linear_k(x) # k = x @ Wk
    v = self.linear_v(x) # v = X @ Wv

    attention = torch.mm(q, k.transpose(0,1)) / self.norm
    attention = nn.Softmax(-1)(attention)
    attention = torch.mm(attention, v)

    return attention

if __name__ == "__main__":
  input = torch.rand(3, 16)
  attention = SelfAttention(dim_in=16, dim_q=8, dim_k=8, dim_v=16)
  output = attention.forward(input)
  print(output.shape)

MultiHead-Attention

import torch
from torch import nn

'''
x.shape: [n, dim_in]
Q = x @ Wq, K = x @ Wk, V = x @ Wv
Q = [Q1,..,Qh], K = [K1,...,Kh], V = [V1,...,Vh]
concat[attention(Q1,K1,V1), ..., attention(Qh,Kh,Vh)]
'''
class MultiHeadSelfAttention(nn.Module):
  def __init__(self, dim_in, dim_q, dim_k, dim_v, num_heads=8):
    super(MultiHeadSelfAttention, self).__init__()
    assert dim_q == dim_k # dim_q == dim_k
    self.dim_in = dim_in
    self.dim_q = dim_q
    self.dim_k = dim_k
    self.dim_v = dim_v
    self.num_heads = num_heads
    self.linear_q = nn.Linear(dim_in, dim_q, bias=False) # Wq
    self.linear_k = nn.Linear(dim_in, dim_k, bias=False) # Wk
    self.linear_v = nn.Linear(dim_in, dim_v, bias=False) # Wv
    self.norm_fact = (dim_k // num_heads)**(1/2)

  def forward(self, x):
    '''x.shape: [n, dim_in]'''
    n, dim_in = x.shape
    assert dim_in == self.dim_in
    dim_q = self.dim_q
    dim_k = self.dim_k
    dim_v = self.dim_v
    heads = self.num_heads

    q = self.linear_q(x).reshape(n, heads, dim_q//heads).transpose(0,1) # (heads, n, dim_q//heads)
    k = self.linear_k(x).reshape(n, heads, dim_k//heads).transpose(0,1) # (heads, n, dim_k//heads)
    v = self.linear_v(x).reshape(n, heads, dim_v//heads).transpose(0,1) # (heads, n, dim_v//heads)

    attention = torch.matmul(q, k.transpose(1,2)) / self.norm_fact # (heads, n, n)
    attention = nn.Softmax(dim=-1)(attention)
    attention = torch.matmul(attention, v) # (heads, n, dim_v//heads)
    # concat: (n, heads, dim_v//heads) -> (n, dim_v)
    attention = attention.transpose(0,1).reshape(n, dim_v) # (n, dim_v)
    return attention
    
if __name__ == "__main__":
  input = torch.rand(3, 16)
  multihead = MultiHeadSelfAttention(dim_in=16, dim_q=8, dim_k=8, dim_v=16, num_heads=8)
  output = multihead.forward(input)
  print(output.shape)

MSA of Images

How to calculate the Multi-head Self Attention of an image, for example one with a shape of 3×224×224 pixels?

Patches
We first splits the RGB image into non-overlapping patches. Each patch is treated as a “token”, a term from NLP meaning something like a unit. And its feature is set as a concatenation of the raw pixel RGB values.
For example, in the implementation of Swin-Transformer, they use a patch size of 4×4 and thus the feature dimension of each patch is 4 × 4 × 3 = 48.
We use patch splitting operation to transform an RGB image from shape of 3×224×224 to 48×56×56, which has $\frac{224}{4}\times\frac{224}{4} = 56\times56$ patches and each patch has $3\times4\times4 = 48$ pixels value. Finally, we transpose or reshape the 48×56×56 image to get a 56×56×48 image.

$\fbox{$\begin{array}{} ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&·\\ ·&·&·&·&·&·&·&· \end{array}$} \xrightarrow{\text{patch splitting}} \begin{array}{} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$}\\ \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$}\\ \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$}\\ \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \fbox{$\begin{array}{}·&·\\·&·\end{array}$} \end{array}$

Attention and Transformer

The Intro

Attention and Q,K,V

Self-Attention

Multi-head Self Attention

What’s the difference between Multi-head and normal Attention

comparison on computation

Something about 1dk\frac{1}{\sqrt{d_k}}dk​​1​

Holistic Perspective

Implementation in Python

Self-Attention

MultiHead-Attention

MSA of Images

Something about $\frac{1}{\sqrt{d_k}}$