深入理解GPT

GPT的结构

首先我们需要了解 GPT 的结构. 例如下图是一个 GPT-1 的结构图, 中间的蓝色框选部分对应 n 个 Transformer 的 Block, 左边为输入和 Embedding 层, 右边为输出预测以及 Loss.

gpt-2

关于输入

原始的输入为一个 [Batch_size, Time_step] 大小的矩阵. 例如字符级别的生成(预测), 输入为 "To be or not to be", 算上空格这是一个 Batch_size=1, Time_steps = 18 的输入.

如何对输入字符进行编码

例如针对 tinyshakespeare 这个文本, 我们统计出其中有 65 种字符, 并按照 ASCII 码的值对它们进行排序

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
chars = sorted(list(set(text))) # len = 65
# here are all the unique characters that occur in this text
'''
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?',
 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
'''

按照排序后的序号(从0开始), 对这 65 种字符进行编码

stoi = { ch:i for i,ch in enumerate(chars) }
# encoder: take a string, output a list of integers
encode = lambda s: [stoi[c] for c in s]
'''
{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12,
'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25,
'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38,
'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51,
'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
'''

itos = { i:ch for i,ch in enumerate(chars) }
# decoder: take a list of integers, output a string
decode = lambda l: ''.join([itos[i] for i in l])
'''
{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?',
13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M',
26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z',
39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i', 48: 'j', 49: 'k', 50: 'l', 51: 'm',
52: 'n', 53: 'o', 54: 'p', 55: 'q', 56: 'r', 57: 's', 58: 't', 59: 'u', 60: 'v', 61: 'w', 62: 'x', 63: 'y', 64: 'z'}
'''

例如对 string 类型的输入 "To be or not to be" 进行 encode 得到的 int 的输出, 以及对 int 的输出进行 decode 又得到 string 的输出.

str_input = "To be or not to be"
int_input = encode(str_input)
'''[32, 53, 1, 40, 43, 1, 53, 56, 1, 52, 53, 58, 1, 58, 53, 1, 40, 43]'''
str_output = decode(int_input)
'''To be or not to be'''

关于 Embedding 层

对于 int 型的 [batch_size, Time_step] 输入, 我们会对其进行 embedding 操作, 将 1 维的整数映射到高维的向量. 具体是通过 nn.Embedding() 函数去进行映射的, 下面对 nn.Embedding 所做的事情进行一个简单的demo.

# An input of batch_size = 2 and time_step = 3, whose elements are integers ranging from 0 to 4
int_input = torch.tensor([[0, 3, 1], [2, 0, 4]], dtype=torch.int)

# initialize an embedding table
embedding_table = nn.Embedding(5, 4) # 0, 1, 2, 3, 4
# display its parameter, which is a matrix of shape 5 x 4
for parameter in embedding_table.parameters():
    print(parameter.shape)
    print(parameter)
'''
torch.Size([5, 4])
Parameter containing:
tensor([[ 0.4282, -2.1105,  0.0480, -0.3238],  # --> 0
        [-1.0883, -1.3166, -0.7612, -0.1886],  # --> 1
        [-0.3456,  1.0513,  2.5954,  0.0092],  # --> 2
        [ 0.9587,  0.5971, -0.1690,  1.7883],  # --> 3
        [-1.9284,  1.5788,  0.5987, -2.2758]], # --> 4
        requires_grad=True)
'''

# embedding the inputs from 1-dim integers to 4-dim floats
tok_embed = embedding_table(int_input)
print(tok_embed)
'''
tensor([[[ 0.4282, -2.1105,  0.0480, -0.3238],  # <-- 0
         [ 0.9587,  0.5971, -0.1690,  1.7883],  # <-- 3
         [-1.0883, -1.3166, -0.7612, -0.1886]], # <-- 1

        [[-0.3456,  1.0513,  2.5954,  0.0092],  # <-- 2
         [ 0.4282, -2.1105,  0.0480, -0.3238],  # <-- 0
         [-1.9284,  1.5788,  0.5987, -2.2758]]],# <-- 4
         grad_fn=<EmbeddingBackward0>)
'''

上面的demo中, 输入数据为 0, 1, 2, 3, 4 一共 5 个不同值, 我们将这 5 个值都分别对应到 5 个 4 维向量, 构成了一个 shape 为 5 × 4 的 embedding_table. 输入 [[0, 3, 1], [2, 0, 4]] 经过 embedding 层, 其中的 0 被映射到 embedding\_table[0] (一个 4 维向量), 3 被映射到 embedding\_table[3], 其它的同理. 这样原本 2 × 3 的输入就变成了 2 × 3 × 4 的输入.

在实际的例子中 embedding_size 会设置比较大, 例如在GPT-1中, 我们将每个字符映射成一个 384 维的向量, token_embedding_table 的参数是个 65 x 384 的矩阵

1
2
3

vocab_size = 65, n_embd = 384
token_embedding_table = nn.Embedding(vocab_size, n_embd)
tok_emb = token_embedding_table(int_input)

另外需要注意的一点是 nn.Embedding 的参数, 即 embedding_table, 是参与训练的, 并随着训练更新.

关于位置编码

【Transformer的位置编码（Position Encoding）进展梳理】

在 Transformer 被提出的论文 Attention is all you need 中就已经开始使用位置编码 (Positional Encoding) 了. 位置编码就是对一次输入中字符的相对位置进行编码.
例如 shape 为 [Batch_size=64, Time_step=256] 的一次输入, 我们一方面要对其中的 65 种字符进行编码 (token_embedding_table), 另一方面还要对每个 batch 中 256 个字符的相对位置 0, 1, …, 255 进行编码 (position_embedding_table), 之后再进行简单的相加就得到输入的编码结果. 这个过程的实现很简单:

B, T = int_input.shape

vocab_size = 65, time_step = 256, n_embd = 384
token_embedding_table = nn.Embedding(vocab_size, n_embd)
position_embedding_table = nn.Embedding(time_step, n_embd) # trainable

tok_emb = token_embedding_table(int_input) # (B,T,C)
pos_emb = position_embedding_table(torch.arange(T)) # (T,C)
x = tok_emb + pos_emb # (B,T,C)

主流的编码方法包括

绝对位置编码
相对位置编码
旋转位置编码

Attention is all you need 所采用的是绝对位置编码, 并且其使用的是三角函数的 Embedding 方案

$\begin{cases} P_{(k,2i)} &= \sin(\frac{k}{10000^{2i/d}})\\ P_{(k,2i+i)} &= \cos(\frac{k}{10000^{2i/d}})\\ \end{cases}$

Bert 的论文使用的也是绝对位置编码, 并且其使用的是 Trainable 的 Embedding 方法 (非公式化的)

绝对位置编码

相对位置编码

旋转位置编码

关于 Transformer Block

GPT 中最核心的模块就是 Transformer Block, 即

而 Transformer Block 中最核心的模块又是 MultiHeadSelfAttention, 它的计算过程可以简化成下面的流程图

graph LR
Input["Input: [B,T,C]"]
Input --"Wq: [C, dim_q]"--> Q
Input --"Wk: [C, dim_k]"--> K
Input --"Wv: [C, dim_v]"--> V;
Q["Q: [B,T,dim_q]"] --> QK
K["K: [B,T,dim_k]"] --> QK
QK["Q·Kᵀ: [B,T,T]
(dim_q=dim_k)"] --> softmax
softmax["Softmax(Q·Kᵀ/√dim_k)
(B,T,T)"] --> Out
V["V: [B,T,dim_v]"] ----> Out
Out["Softmax(·)·V: [B,T,dim_v]"]

对于一个 Time_step 不固定的输入 input: [B, T, C] 而言, 它首先与三个参数矩阵 Wq: [C, dim_q], Wk: [C, dim_k], Wv: [C, dim_v] 相乘, 被线性映射到 Q: [B, T, dim_q], K: [B, T, dim_k], V: [B, T, dim_v] 三个输入. 其中的 3 个矩阵是 Attention 乃至 Transformer 和 GPT 的主要参数.

关于 dim_q, dim_k, dim_v 的说明:

为了计算 $Q\cdot K^T$ , 需要保证 dim_q = dim_k;
为了衔接两个 Transformer Block, 保证上一个 Block 的输出 [B, T, dim_v] 能直接输入到下一个 Block [B, T, C], 通常会设置 dim_v = C, 否则我们需要额外做一个 dim_v 到 C 的映射.

从上面 Attention 的计算过程可以看出来, Transformer 处理变步长输入的核心在于计算 Q,K,V 时, 矩阵乘法对输入 Shape [B, T, C] 中的 B, T 没有要求. 输入要与 Wq: [C, dim_q], Wk: [C, dim_k], Wv: [C, dim_v] 相乘, 这个过程只要求输入 Shape 中的 C 是固定的.

关于 MultiHead

对于 MultiHead, 有两种计算方式, 一种是只映射一次 $Q, K, V$ , 然后将其按照最后一维, 即 [B, T, C] 中的 "C" 维, 分成 h 个 $Q_1\cdots Q_h$ , $K_1\cdots K_h$ , $V_1\cdots V_h$ , 再分别计算每个 $Q_i, K_i, V_i$ 的 Attention, 最后进行 Concat 得到 MultiHead 的输出.

$\begin{split} Q &= \text{concat}[Q_1\cdots Q_h]\\ K &= \text{concat}[K_1\cdots K_h]\\ V &= \text{concat}[V_1\cdots V_h] \end{split}$

原本 $Q,K,V$ 的Shape为 [B, T, dim], 分割后每个 $Q_i, K_i, V_i$ 的shape变成 [B, T, dim/h]. 下面是普通 SelfAttention 和 MultiHeadSelfAttention 的对比:

Normal SelfAttention

$\text{Attention}(Q,K,V) = \text{Attention}(\text{concat}[Q_1\cdots Q_h],\text{concat}[K_1\cdots K_h],\text{concat}[V_1\cdots V_h])$
MultiHead SelfAttention

$\text{MultiHead}(Q,K,V) = \text{concat}[\text{Attention}(Q_1,K_1,V_1),\cdots, \text{Attention}(Q_h,K_h,V_h)]$

import torch
from torch import nn

'''
x.shape: [B, T, C]
Q = x @ Wq; K = x @ Wk; V = x @ Wv
Q.shape: [B, T, dim_q]; K.shape: [B, T, dim_k]; V.shape: [B, T, dim_v]
Q = [Q1,..,Qh], K = [K1,...,Kh], V = [V1,...,Vh]
Q1-Qh: [B, T, dim_q/h]; K1-Kh: [B, T, dim_k/h]; V1-Vh: [B, T, dim_v/h]
concat[attention(Q1,K1,V1), ..., attention(Qh,Kh,Vh)]
attention(Q1,K1,V1)-attention(Qh,Kh,Vh): [B, T, dim_v/h]
'''
class MultiHeadSelfAttention(nn.Module):
  def __init__(self, dim_in, dim_k, dim_v, num_heads=8):
    super(MultiHeadSelfAttention, self).__init__()
    # assert dim_q == dim_k # dim_q == dim_k
    self.dim_in = dim_in
    self.dim_q = dim_k
    self.dim_k = dim_k
    self.dim_v = dim_v
    self.num_heads = num_heads
    self.linear_q = nn.Linear(dim_in, dim_k, bias=False) # Wq
    self.linear_k = nn.Linear(dim_in, dim_k, bias=False) # Wk
    self.linear_v = nn.Linear(dim_in, dim_v, bias=False) # Wv
    self.norm_fact = (dim_k // num_heads)**(1/2)

  def forward(self, x):
    '''x.shape: [B, T, dim_in]'''
    B, T, dim_in = x.shape
    assert dim_in == self.dim_in
    dim_q = self.dim_q
    dim_k = self.dim_k
    dim_v = self.dim_v
    heads = self.num_heads

    q = self.linear_q(x).reshape(B, T, heads, dim_q//heads).transpose(1,2) # (B, heads, T, dim_q//heads)
    k = self.linear_k(x).reshape(B, T, heads, dim_k//heads).transpose(1,2) # (B, heads, T, dim_k//heads)
    v = self.linear_v(x).reshape(B, T, heads, dim_v//heads).transpose(1,2) # (B, heads, T, dim_v//heads)

    # Q:[B, h, T, dim_q//h] @ K^T: [B, h, dim_k//h, T] -> Q@K^T: [B, h, T, T]
    attention = torch.matmul(q, k.transpose(-2,-1)) / self.norm_fact
    attention = nn.Softmax(dim=-1)(attention)
    # Q@K^T: [B, h, T, T] @ V: [B, h, T, dim_v//h] -> (Q@K^T)@V: [B, h, T, dim_v//h]
    attention = torch.matmul(attention, v)
    # concat: [B, h, T, dim_v//h] -> [B, T, h, dim_v//h] -> (B, T, dim_v)
    attention = attention.transpose(1,2).reshape(B, T, dim_v)
    return attention
    
if __name__ == "__main__":
  # Batch_size, Time_step, Embedding_size
  B, T, C = 64, 256, 384
  input = torch.rand(B, T, C)
  # input:(B, T, C) -> Q:(B, T, dim_q), K:(B, T, dim_k), V:(B, T, dim_v)
  # Q = [Q1,..,Qh], K = [K1,...,Kh], V = [V1,...,Vh]
  # Q1-Qh: [B, T, dim_q/h]; K1-Kh: [B, T, dim_k/h]; V1-Vh: [B, T, dim_v/h]
  # Normally, there will be dim_q = dim_k = dim_v = C
  multihead = MultiHeadSelfAttention(dim_in=C, dim_k=C, dim_v=C, num_heads=8)
  output = multihead.forward(input)
  print(output.shape)

另外一种是设计多个 AttentionHead 来处理输入, 计算得到多个输出, 最后再对每个 Attention 的输出进行 Concat. 每个 Attention 的参数矩阵大小为 [C, dim_q|k|v // num_head], 对应输出的 shape 为 [B, T, dim_k // num_head], 将所有 Attention 的结果拼接在一起后得到的输出 shape 为 [B, T, dim_k]

$\text{MultiHead}(x[B, T, C]) = \text{concat}[\underset{[B, T, dim\_k // num\_head]}{\text{Attention}(x[B, T, C])}, \cdots, \underset{[B, T, dim\_k // num\_head]}{\text{Attention}(x[B, T, C])}]$

import torch
from torch import nn

class Head(nn.Module):
  """ one head of self-attention """
  def __init__(self, dim_in, head_size):
    super().__init__()
    # dim_q = dim_k = dim_v = head_size = dim_in // num_heads
    self.query = nn.Linear(dim_in, head_size, bias=False) # Wq
    self.key = nn.Linear(dim_in, head_size, bias=False)   # Wk
    self.value = nn.Linear(dim_in, head_size, bias=False) # Wv
    self.norm_fact = head_size**0.5

  def forward(self, x):
    # input of size (batch, time-step, channels)
    # output of size (batch, time-step, head size)
    B,T,C = x.shape
    
    q = self.query(x) # (B,T,hs)
    k = self.key(x)   # (B,T,hs)
    v = self.value(x) # (B,T,hs)

    # compute attention scores ("affinities")
    qk = q @ k.transpose(-2,-1) / self.norm_fact # (B, T, hs) @ (B, hs, T) -> (B, T, T)
    sftmax_qk = nn.Softmax(dim=-1)(qk) # (B, T, T)
    attention = sftmax_qk @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
    return attention

class MultiHeadSelfAttention(nn.Module):
  """ multiple heads of self-attention in parallel """

  def __init__(self, dim_in, num_heads):
    super().__init__()
    self.heads = nn.ModuleList([Head(dim_in=dim_in, head_size=dim_in//num_heads) for _ in range(num_heads)])

  def forward(self, x):
    out = torch.cat([h(x) for h in self.heads], dim=-1)
    return out

if __name__ == "__main__":
  # Batch_size, Time_step, Embedding_size
  B, T, C = 64, 256, 384
  input = torch.rand(B, T, C)
  #                 -> Q1,-,Qh: (B, T, C//num_heads)
  # input:(B, T, C) -> K1,-,Kh: (B, T, C//num_heads) 
  #                 -> V1,-,Vh: (B, T, C//num_heads)
  # multihead = concat[Attention(Q1,K1,V1), ..., Attention(Qh,Kh,Vh)]
  multihead = MultiHeadSelfAttention(dim_in=C, num_heads=8)
  output = multihead.forward(input)
  print(output.shape)

两者只是计算形式不同, 但是所表达的意思是相同的, 在实际编程中会倾向于第一种, 采用 reshape + transpose 的方式可以简化代码. 而在理论的理解上倾向于后一种, 更显式地表达 Bagging 的思想.

两者在参数量上也没有差别, 第一种只设计一个 Attention 的 MultiHead 实现方式拥有的 3 个参数矩阵, 大小为 Wq: [C, dim_q], Wk: [C, dim_k], Wv: [C, dim_v]; 第二种采用多个 Attention 的 MultiHead 实现方式拥有 num_head × 3 个参数矩阵, 每组矩阵的大小为 Wq: [C, dim_q // num_head], Wk: [C, dim_k // num_head], Wv: [C, dim_v // num_head].

关于 Q,K,V 的 Shape

对于注意力机制本身而言, 从 Query | Key | Value 本身的意义去思考, 应该会发现

T_key == T_value, 因为键值对是成对出现的;
T_query ?= T_key, 查询请求的个数不需要等于键值对的个数;
dim_q == dim_k, 因为 query 是基于 key 去查询的, 计算 query 和 key 的相似度, 按照某种方式 (max, mean) 输出 value;
dim_v ?= dim_k, value 的维度不需要和 key 的维度相同, 它可以是任意的.

为了计算 $Q\cdot K^T$ , 需要保证 dim_q == dim_k;
为了衔接两个 Transformer Block, 保证上一个 Block 的输出 [B, T, dim_v] 能直接输入到下一个 Block [B, T, C], 通常会设置 dim_v == C, 否则我们需要额外做一个 dim_v 到 C 的映射.

关于 Q,K,V 的意义

Q, K, V 分别对应 Query, Key, Value. 想象一下这样一个场景, 我们有一组键值对 {key1: value1, ..., key_t: value_t}, 现在有一个 query0 (一种和 key 同类型的数据), 需要在这组键值对中找到和 query0 最相似的 key, 输出该 key 的 value. 现在我们假设 $\text{query, key} \in \R^k, \text{value} \in \R^v$ , 对于所有的 key 可以构成一个 t × k 的一个矩阵 $K$ ,

$(q, K) = q_1K^T = \begin{bmatrix} q_1k_1^T & q_2k_2^T & \cdots & q_1k_t^T % \\ % \cdots & \cdots & \cdots & \cdots\\ % q_1k_1^T & q_2k_2^T & \cdots & q_1k_t^T\\ \end{bmatrix}$

$q_1$ 与 $K$ 做内积, 或者说与所有的 $k_1, \cdots, k_t$ 做内积, 得到的是 query1 与所有 key 的 “相似度”.

如果按照最大相似度的策略直接输出 value, 设 $\argmax(q_1K^T) = i$ , 则有:

$[0,\cdots, 0, \underset{i}{1}, 0, \cdots, 0] \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_t \end{bmatrix} = v_{i}$

这样得到的是与 query 相似度最大的 key 所对应的 value.

如果按照相似度加权输出 value, 则有:

$\text{softmax}(q_1K^T)V = \left[\frac{\exp(q_1k_1^T)}{\sum\exp(q_1k_i^T)}, \cdots, \frac{\exp(q_1k_t^T)}{\sum\exp(q_1k_i^T)}\right] \begin{bmatrix} v_1\\ \vdots \\ v_t \end{bmatrix}$

这样得到并不是某个 key 所对应的 value, 而是这组 value 的一个线性组合, 即 $\{v_1, \cdots, v_t\}$ 所张成空间中的一个元素.

注意 $\sqrt{d_k}$ 在计算 softmax 的过程中不能被消掉

$\begin{split} \text{softmax}(q_1 k_1^T/\sqrt{d_k}, \cdots, q_1 k_N^T/\sqrt{d_k}) \\ \neq \text{softmax}(q_1 k_1^T, \cdots, q_1 k_N^T) \end{split}$

$\text{softmax}(\frac{Q K^T}{\sqrt{d_k}}) = \begin{bmatrix} \text{softmax}(q_1 k_1^T/\sqrt{d_k}, \cdots, q_1 k_N^T/\sqrt{d_k})\\ \text{softmax}(q_2 k_1^T/\sqrt{d_k}, \cdots, q_2 k_N^T/\sqrt{d_k})\\ \vdots\\ \text{softmax}(q_N k_1^T/\sqrt{d_k}, \cdots, q_N k_N^T/\sqrt{d_k}) \end{bmatrix}$

关于上三角置 -inf

和时间序列/文本生成相关的任务, 在计算 Attention 时, 通常会对 $QK^T$ 的上三角部分置 $-\infty$ , 这样在取 softmax 后, 上三角部分会变成 0.

import torch
from torch import nn

B, T, C = 2, 4, 6
Q = torch.rand(B, T, C)
K = torch.rand(B, T, C)
V = torch.rand(B, T, C)
QK = Q @ K.transpose(-2, -1) / (K.shape[0]**0.5)
tril = torch.tril(torch.ones(T, T))
'''
print(tril)
tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])
'''
QK = QK.masked_fill(tril[:T, :T] == 0, float('-inf')) # <-- -inf
print(QK[0])
'''
tensor([[0.7288,   -inf,   -inf,   -inf],
        [0.7355, 1.7677,   -inf,   -inf],
        [0.8977, 1.2803, 1.0003,   -inf],
        [0.8913, 0.7871, 0.8950, 0.6269]])
'''
sftmax_QK = nn.Softmax(dim=-1)(QK)
print(sftmax_QK[0])
'''
tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.2627, 0.7373, 0.0000, 0.0000],
        [0.2798, 0.4102, 0.3100, 0.0000],
        [0.2723, 0.2454, 0.2733, 0.2090]])
'''
attention = sftmax_QK @ V

注意应该是先置 -int 再取 softmax, 而不能是先取 softmax 再置 0. 因为后者并没有归一化.

为什么要将 $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})$ 的上三角置 0?

因为我们认为 $V = \{v_1, \cdots, v_t\}$ , 分别对应着输入的 t 个 token, $v_i$ 是输入 $x_i$ 的一个线性变换

$\begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_t \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_t \end{bmatrix} W_v = \begin{bmatrix} x_1 W_v \\ x_2 W_v \\ \vdots \\ x_t W_v \end{bmatrix}$

当在做预测的时候, 我们是用前 n 个 token 去预测第 n+1 个 token. 例如用 $\{x_1\}$ 的信息去预测 $x_2$ ; 用 $\{x_1, x_2\}$ 的信息去预测 $x_3$ ; …; 用 $\{x_1,x_2, \cdots, x_t\}$ 的信息去预测 $x_{t+1}$ . 于是在预测 $x_n$ 时, 我们不能给模型提供 $x_n$ 以及它之后的信息, 需要把这些信息"删掉". 而将 $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})$ 的上三角置 0 就是在做 “删除” 未来信息的操作.

$\text{softmax}(\frac{QK^T}{\sqrt{d_k}})\cdot V = \begin{bmatrix} \alpha_{11} & 0 & \cdots & 0\\ \alpha_{21} & \alpha_{22} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ \alpha_{t1} & \alpha_{t2} & \cdots & \alpha_{tt}\\ \end{bmatrix} \cdot \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_t \end{bmatrix} = \begin{bmatrix} p_{x_2} \\ p_{x_3} \\ \vdots \\ p_{x_{t+1}} \end{bmatrix}$

从这个公式可以看到 $p_{x_2}$ (对 $x_2$ 的 prediction) 只用到了 $v_1$ ( $x_1$ 的线性变换) 的信息而没有用到 $v_1$ 之后 $v_2,\cdots, v_t$ 的信息, $p_{x_3}$ (对 $x_3$ 的预测) 只用到了 $v_1, v_2$ 的信息, 而没有用到之后的信息; 最后的 $p_{x_{t+1}}$ 用到了整个句子的信息.

关于推理过程的 token 生成

GPT 在推理阶段有两个方法, 一个就是正常的 forward, 另一个是 generate.

forward 方法会对一个长度为 L 的序列输入进行预测, 输出一个长度为 L 的序列. 例如输入为 [token(1), token(2), …, token(L)], 则输出为 [token(2), token(3), …, token(L+1)]. 一次 forward 会预测 L 个 token, 其中 token(i) 是对 [token(1), …, token(i-1)] 的预测, 但只有最后一个 token, token(L+1), 是我们需要的.

注意这里用正体和斜体来区分实际的输入 token 和预测的 token

generate 方法, 对一个长度为 L 的序列输入, 会循环调用 forward 方法预测这段序列之后的 N 个 token (N 是一个超参数). 例如输入为 [token(1), token(2), …, token(L)], 调用一次 forward 得到 [token(2), token(3), …, token(L+1)], 将最后一个 token(L+1) 取出与输入的 L 长度序列 [token(1), token(2), …, token(L)] 进行 concatenate 得到一个 L+1 长度的序列 [token(1), token(2), …, token(L), token(L+1)], 再将这个 L+1 长度的序列作为输入进行一次 forward, 这样可以得到第 L+2 个 token 的预测 token(L+2), 再进行一次 concatenate, 得到 L+2 长度的序列… 如此往复, 可以得到 L+N 长度的输出.

如何计算 Loss

关于 $\frac{1}{\sqrt d_k}$

Encoder-Decoder 与 Decoder-only 架构

如何加载大型数据集 | 内存映射

numpy.memmap