Pytorch安装

  1. 安装miniconda

  2. 安装cuda(本台电脑暂时没有独立显卡,暂时跳过)

  3. 使用conda创建新的python虚拟环境

    python
    1
    conda create --name pytorch python=3.9

    激活新的python环境

    python
    1
    conda activate pytorch

    Note: 这里的pytorch仅仅是环境的名字, 可以修改为其它任何你喜欢的环境名.

  4. 进入Pytorch官网, 根据自己的情况进行选择, 注意如果电脑没有独显, 没有安装CUDA时请选择CPU版本

    选择好后获得一串安装指令, 将其在之前创建的conda环境下运行

    install pytorch

    Note1: 此时给出的句子后有 -c pytorch参数, 这表示从官网下载, 国内速度会比较慢. 如果你已经配置好conda的下载源, 例如清华源, 阿里源等, 删去 -c pytorch 参数即可. 否则请挂VPN再执行命令.

    第一次选择Conda安装失败, 无论是去掉 -c pytorch 或者不去掉, 最后都没能正常安装.

    第二次选择Pip安装速度过慢30kb/s, Ctrl+C终止安装, 第三次挂梯子Pip安装正常2mb/s.

    Note2: CUDA版本的选择针对有支持CUDA独显的电脑, 如果不支持或者无显卡, 则选择 CPU.

    Note3: Conda如果安装失败可以尝试使用Pip安装

  5. 安装完成后会返回如下信息

    python
    1
    2
    3
    Requirement already satisfied: certifi>=2017.4.17 in c:\users\xiaophai\.conda\envs\pytorch\lib\site-packages (from requests->torchvision) (2022.12.7)
    Installing collected packages: urllib3, typing-extensions, pillow, numpy, idna, charset-normalizer, torch, requests, torchvision, torchaudio
    Successfully installed charset-normalizer-2.1.1 idna-3.4 numpy-1.24.1 pillow-9.4.0 requests-2.28.1 torch-1.13.1 torchaudio-0.13.1 torchvision-0.14.1 typing-extensions-4.4.0 urllib3-1.26.13
  6. 测试Pytorch是否能正常运行

    python
    1
    2
    3
    4
    5
    6
    7
    8
    9
    (pytorch) C:\Users\xiaophai>python
    Python 3.9.15 (main, Nov 24 2022, 14:39:17) [MSC v.1916 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>>
    >>> print(torch.__version__)
    1.13.1+cpu
    >>> print(torch.cuda.is_available())
    False

Pytorch

官方文档: Pytorch官方文档

中文文档: Pytorch中文文档

Tensor

Tensor的创建

Pytorch的tensor类似于Numpy的ndarrays, 并且Pytorch中的数据(向量, 矩阵)绝大多数都是以tensor类型存储的.

  • 创建空矩阵
python
1
2
3
torch.empty(3,4)
#或者
torch.zeros(3,4)
python
1
2
3
4
#输出
tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
  • 创建1矩阵
python
1
torch.ones(3,4)
python
1
2
3
4
#输出
tensor([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
  • 创建自定义矩阵
python
1
2
3
torch.Tensor([[1,2,3,4],
[4,5,6,7],
[7,8,9,0]])
  • 创建随机矩阵

torch.rand返回[0,1)上均匀分布的随机数

python
1
torch.rand(3,4)
python
1
2
3
4
#输出
tensor([[0.1591, 0.5634, 0.0369, 0.6377],
[0.2301, 0.8195, 0.9913, 0.6825],
[0.8075, 0.8222, 0.6498, 0.5535]])
  • 创建随机的整数

torch.randint

python
1
torch.randint(low, high, size) #左开右闭
python
1
2
3
#生成16个[0-10)上的整数
>>> torch.randint(0, 10, (16,))
tensor([9, 2, 7, 9, 0, 7, 1, 9, 5, 1, 8, 3, 5, 7, 3, 0])
  • 创建序列
python
1
torch.arange(2,5)
python
1
2
#输出
tensor([2, 3, 4])
  • 由序列reshape成矩阵
python
1
torch.arange(9).reshape(3,3)
python
1
2
3
tensor([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
  • Numpy➡Pytorch
python
1
2
array = numpy.ones(2,3)
tensor = torch.from_numpy(array)
python
1
2
3
4
5
6
#输出array
[[1. 1. 1.]
[1. 1. 1.]]
#输出tensor
tensor([[1., 1., 1.],
[1., 1., 1.]], dtype=torch.float64)
  • 维度信息
python
1
2
x = torch.zeros(3,4)
print(x.size())
python
1
2
#输出
torch.Size([3, 4])

tensor.item()

torch.tensor.item | Torch Docs

item() 是 Pytorch 中 tensor 的一个方法, 用于返回 tensor 的 python 类型的值, 仅仅用于单个元素的tensor

  • Pytorch 演示
python
1
2
3
4
5
6
7
8
a = torch.tensor([1,2,3])

print(a[0], type(a[0]))
#输出 tensor(1) <class 'torch.Tensor'>
print(a[0].item(), type(a[0].item()))
#输出 1 <class 'int'>
print(a.item())
#报错 ValueError: only one element tensors can be converted to Python scalars

Tensor的索引

python
1
2
3
x = torch.Tensor([[1, 2, 3, 4],
[5, 6, 7, 8],
[9,10,11,12]])
  • 索引第一个数

tensor的索引与numpy一样, 下标是从0开始

python
1
2
3
x[0,0]
#或者
x[0][0]
python
1
2
#输出
tensor(1.)
  • 索引第二行
python
1
2
3
4
5
x[1]
#或者
x[1,]
#或者
x[1,:]
python
1
2
#输出
tensor([5., 6., 7., 8.])
  • 索引第二列
python
1
x[:,1]
python
1
2
#输出
tensor([ 2., 6., 10.])

注意x[,1]报错


python
1
x[:,[1]]
python
1
2
3
4
#输出
tensor([[ 2.],
[ 6.],
[10.]])
  • 索引第3行的第1个数到第3个数

0:3和python索引一样, 表示[0,3)[0,3), 是左开右闭的

python
1
2
3
x[2,0:3]
#或者
x[2][0:3]
python
1
tensor([ 9., 10., 11.])
  • 索引2,3行1到3列的数
python
1
x[1:3,0:3]
python
1
2
3
#输出1:3行,0:3列
tensor([[ 5., 6., 7.],
[ 9., 10., 11.]])
  • 索引最后两行

tensor的索引也像python一样支持负数索引, 意味着倒数

python
1
2
3
x[-2:]
#-2表示倒数第2位
#[-2:]表示从倒数第2位到最后,即最后两位
python
1
2
3
#输出
tensor([[ 5., 6., 7., 8.],
[ 9., 10., 11., 12.]])
  • 自定义索引

使用元组可以自定索引的下标

python
1
x[:,[2,1,0]]
python
1
2
3
4
#输出第3,2,1列
tensor([[ 3., 2., 1.],
[ 7., 6., 5.],
[11., 10., 9.]])

python
1
2
3
x[[0,1,2],[2,1,0]]
#或者
x[(0,1,2),(2,1,0)]
python
1
2
#输出x[0,2],x[1,1],x[2,0]
tensor([3., 6., 9.])

Tensor的运算

  • 加减法

Tensor的加法和减法都是符合矩阵的加减运算的, 同尺寸的Tensor间可以进行加减运算, 或者Tensor和数之间进行加减运算, 但是不同尺寸的Tensor间不能进行加减运算, 否则会报错

RuntimeError: The size of tensor a must match the size of tensor b at non-singleton dimension 1

python
1
2
3
4
5
#四种加法
result = x+y
result = x.add(y)
result = torch.add(x,y)
x.add_(y) #x=x+y
python
1
2
3
4
5
#四种减法
result = x-y
result = x.sub(y)
result = torch.sub(x,y)
x.sub_(y) #x=x-y
  • 乘除法

此处的乘法是同尺寸的Tensor间元素的一对一的乘法, 并非矩阵乘法. 对Tensor尺寸的要求同矩阵加减法一样, 必须是同尺寸的Tensor间或者两个Tensor中至少有一个为常数, 才可以进行运算.

python
1
2
3
4
5
#四种乘法
result = x*y
result = x.multiply(y)
result = torch.multiply(x,y)
x.multiply_(y) #x=x*y
python
1
2
3
4
5
#四种除法
result = x/y
result = x.div(y)
result = torch.div(x,y)
x.div_(y) #x=x/y
  • 求倒数
python
1
2
3
result = 1/x
result = torch.reciprocal(x)
result = x.reciprocal()
  • 幂运算

参与幂运算的两个Tensor的尺寸必须一样, 结过为对应元素的幂次方

python
1
2
3
4
result = x**y
result = x.pow(y)
result = torch.pow(x,y)
x.pow_(y) #x=x**y

广播机制

当两个张量的维度不同,对他们进行运算时,需要对维度小的张量进行扩展,扩展成高纬度的张量,这个扩展的过程采用的是广播机制,即对低维度数据进行广播式(拷贝)扩展

满足一下情况的tensor是可以广播的

  • 至少有一个维度
  • 两个tensor维度相等
  • 维度不等,其中一个为1
  • 维度不等,其中一个维度不存在

计算规则

  • 维度不同,小维度的增加维度
  • 每个维度,计算结果取大的
  • 扩展维度是对数值进行复制
python
1
2
3
4
5
# 广播机制
x = torch.Tensor([[1],
[2],
[3]])
y = torch.Tensor([[1, 2]])
python
1
2
3
4
# 输出
tensor([[2., 3.],
[3., 4.],
[4., 5.]])

Pytorch线性代数

  • 矩阵的转置
python
1
A = torch.tensor([[1,2,3],[4,5,6]])
python
1
2
3
4
5
6
7
#输出A
tensor([[1, 2, 3],
[4, 5, 6]])
#输出A.T
tensor([[1, 4],
[2, 5],
[3, 6]])
  • 矩阵乘法

两个Tensor的尺寸必须满足线代矩阵乘法的规则才可以进行矩阵乘法

python
1
2
3
A = torch.arange(6).reshape(2,3)
B = torch.arange(6).reshape(3,2)
AB = torch.mm(A,B)
python
1
2
3
4
5
6
7
8
9
10
#输出A
tensor([[0, 1, 2],
[3, 4, 5]])
#输出B
tensor([[0, 1],
[2, 3],
[4, 5]])
#输出AB
tensor([[10, 13],
[28, 40]])
  • 矩阵和向量的乘法

torch.mm()的两个参数必须是矩阵, 需要有两个维度(即使一个维度值为1). 在矩阵和向量的运算里面就需要用到函数torch.mv()

python
1
2
3
A = torch.arange(6).reshape(2,3)
x = torch.arange(3)
Ax = torch.mv(A,x)
python
1
2
3
4
5
6
7
8
#输出A
tensor([[0, 1],
[2, 3],
[4, 5]]) torch.Size([2, 3])
#输出x
tensor([0, 1, 2]) torch.Size([3])
#输出Ax
tensor([ 5, 14]) torch.Size([2])

范数

  • 欧氏范数
python
1
2
x = torch.tensor([3.,4]) #类型为torch.FloatTensor
Norm = torch.norm(x)
python
1
2
3
4
#输出x及其数据类型
tensor([3., 4.])
#输出Norm
tensor(5.)

注意, 在计算norm的时候, 向量的类型必须是浮点型的, 不能是整型的, 否则会报错

python
1
2
x = torch.tensor([3,4]) #类型为torch.LongTensor
torch.norm(x)
python
1
2
#报错
RuntimeError: norm(): input dtype should be either floating point or complex. Got Long instead.

自动微分

torch.Tensor.backward | pytorch docs

Autograd mechanics | pytorch docs

李沐-自动微分李沐-07 自动求导【动手学深度学习v2】

Pytorch最核心的一个功能就是通过backward进行自动微分

  • 计算图

Pytorch的自动微分是通过计算图来实现的, 一个函数 zz 和变量的关系构成一个无环图, 每个结点就是一次+/-/*/sum/mean…的操作, Pytorch会记录这些操作, 然后通过复合函数求导的链式法则来计算相应的梯度.

计算图

  • requires_grad

Pytorch中的tensor有一个额外的属性叫做 requires_grad, 它默认状态为 False, 当把它设为 True 时, 表示这个tensor是需要被求导数的变量

python
1
2
3
4
5
x = torch.arange(3, requires_grad=False, dtype=float)
y = torch.dot(x,x) #张量内积
y.backward()
# 当requires_grad=False时,自动求导如下报错
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Note: 只有浮点数才可以参与求导运算, 所以上面(以及之后)定义tensor时令它的dtype=float

  • detach

通过requires_grad_()可以设置tensor的

  • grad

tensor的grad属性用于记录该变量的梯度值, 每次backward操作的梯度值都会累加在grad中

python
1
2
3
4
5
6
7
8
x = torch.arange(3, requires_grad=True, dtype=float)
print(x.grad) #输出None
y = torch.dot(x,x)
y.backward()
print(x.grad) #输出tensor([0., 2., 4.], dtype=torch.float64)
z = torch.sum(x)
z.backward()
print(x.grad) #输出tensor([1., 3., 5.], dtype=torch.float64)

在第一次没有进行任何backward()操作时, x.grad是None;

进行第一次backward操作后, x.grad的值变成了yx=[0,2,4]\frac{\partial y}{\partial x} = [0,2,4];

进行第二次backward操作后, x.grad的值在第一次的[0,2,4][0,2,4]的基础上又加了zx=[1,1,1]\frac{\partial z}{\partial x} = [1,1,1], 变成了[1,3,5][1,3,5].

  • grad.zero_()

由于Pytorch会累加grad, 所以在进行新的backward操作之前一般会使用grad.zero_()对tensor的梯度进行清零

下面的代码在上面介绍grad的代码的基础上增加了x.grad.zero_()操作

python
1
2
3
4
5
6
7
...
print(x.grad) #输出tensor([0., 2., 4.], dtype=torch.float64)
x.grad.zero_()
print(x.grad) #输出tensor([0., 0., 0.], dtype=torch.float64)
z = torch.sum(x)
z.backward()
print(x.grad) #输出tensor([1., 1., 1.], dtype=torch.float64)
  • retain_grad()

计算图中非叶子结点的张量(中间变量),需用.retain_grad()方法保留中间变量的梯度,否则它的梯度将会在反向传播完成之后被释放掉

z=y2,y=xTx=x12+x22+x32zx=zyyx=2y[2x1,2x2,2x3]=4y[x1,x2,x3]\begin{gather} z = y^2,\qquad y = \bold{x}^T\bold{x} = x_1^2+x_2^2+x_3^2\\ \frac{\partial z}{\partial \bold{x}} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial \bold{x}} = 2y[2x_1,2x_2,2x_3] = 4y[x_1,x_2,x_3] \end{gather}

python
1
2
3
4
5
6
7
8
9
x = torch.arange(3, requires_grad=True, dtype=float)
y = torch.dot(x,x)
print(y.requires_grad) #True
z = y**2
print(z.requires_grad) #True

z.backward()
print(y.grad) #None
print(x.grad) #tensor([ 0., 20., 40.], dtype=torch.float64)

在上面的示例中可以看到, y作为中间变量, 它的requires_grad为True, 但是 z 在做backward操作之后, 它的 grad 为 None. 并且编译器给了一个Warning.

python
1
UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute woult not be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

在上面代码的基础上加一句 y.retain_grad(), 以此来保留中间变量y的梯度. 加上之后, 就可以看到y.grad的值了.

python
1
2
3
4
5
......
y.retain_grad()
z.backward()
print(y.grad) #tensor(10., dtype=torch.float64)
print(x.grad) #tensor([ 0., 20., 40.], dtype=torch.float64)

dim

Pytorch的dim是从0开始算的, 一个 2×3×42\times3\times4 的 Tensor 如下所示

tensor([0[1[2a000,a001,a002,a003],[a010,a011,a012,a013],[a020,a021,a022,a023]],[[a100,a101,a102,a103],[a110,a111,a112,a113],[a120,a121,a122,a123]]])\begin{split} \rm tensor ( \textcolor{red}{\overset{0}{\boldsymbol{[}}} \textcolor{blue}{\overset{1}{\boldsymbol{[}}} &\textcolor{green}{\overset{2}{\boldsymbol{[}}} a_{000},a_{001},a_{002},a_{003}\textcolor{green}{\boldsymbol{]}},\\ &\textcolor{green}{\boldsymbol{[}}a_{010},a_{011},a_{012},a_{013}\textcolor{green}{\boldsymbol{]}},\\ &\textcolor{green}{\boldsymbol{[}}a_{020},a_{021},a_{022},a_{023}\textcolor{green}{\boldsymbol{]}} \textcolor{blue}{\boldsymbol{]}},\\\\ \textcolor{blue}{\boldsymbol{[}} &\textcolor{green}{\boldsymbol{[}}a_{100},a_{101},a_{102},a_{103}\textcolor{green}{\boldsymbol{]}},\\ &\textcolor{green}{\boldsymbol{[}}a_{110},a_{111},a_{112},a_{113}\textcolor{green}{\boldsymbol{]}},\\ &\textcolor{green}{\boldsymbol{[}}a_{120},a_{121},a_{122},a_{123}\textcolor{green}{\boldsymbol{]}} \textcolor{blue}{\boldsymbol{]}} \textcolor{red}{\boldsymbol{]}} ) \end{split}

对一个 3 维的 d0×d1×d2d_0\times d_1\times d_2 的Tensor进行sum操作, 指定 dim{0,1,2}\rm dim\in\{0,1,2\}, 分别得到

id0aijkjd1aijkkd2aijk\sum_{i\in d_0} a_{ijk} \qquad \sum_{j\in d_1} a_{ijk} \qquad \sum_{k\in d_2} a_{ijk}

例如分别指定 dim=0,1,2torch.ones(2,3,4) 进行求和

python
1
arr = torch.ones(2,3,4)
python
1
2
3
4
5
6
7
8
# 输出arr
tensor([[[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]],

[[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]]])
python
1
2
3
4
# 输出 torch.sum(arr,dim=0)
tensor([[2., 2., 2., 2.],
[2., 2., 2., 2.],
[2., 2., 2., 2.]])
python
1
2
3
# 输出 torch.sum(arr,dim=1)
tensor([[3., 3., 3., 3.],
[3., 3., 3., 3.]])
python
1
2
3
# 输出 torch.sum(arr,dim=2)
tensor([[4., 4., 4.],
[4., 4., 4.]])

Logistic Regression

分类问题中比较的不是数值的大小, 例如在MNIST手写数字数据集中, 所要算的不是输入图片的某种大小, 而是它属于0-9这10个数字类别的概率

Pr(x0),Pr(x1),,Pr(x9)Pr(x\in 0),Pr(x\in1),\cdots,Pr(x\in9)

Logistic Function

σ(x)=11+ex,xR\sigma(x) = \frac{1}{1+e^{-x}},\quad x\in \R

其它的一些 sigmoid 函数

erf(π2x),x1+x2,tanh(x){\rm erf}(\frac{\sqrt{\pi}}{2}x),\quad \frac{x}{\sqrt{1+x^2}},\quad \tanh(x)

2πarctan(π2x),2πgd(π2x),x1+x\frac{2}{\pi}\arctan(\frac{\pi}{2}x),\quad \frac{2}{\pi}gd(\frac{\pi}{2}x),\quad \frac{x}{1+|x|}

Softmax

设神经网络最后一层的输出 z=[z1,z2,,zn]Rn\bold{z} = [z_1,z_2,\cdots,z_n]\in\R^n , 它的Softmax函数值为

[ez1ez1++ezn,ez2ez1++ezn,,eznez1++ezn]\begin{bmatrix} \frac{e^{z_1}}{e^{z_1}+\cdots+e^{z_n}}, \frac{e^{z_2}}{e^{z_1}+\cdots+e^{z_n}}, \cdots, \frac{e^{z_n}}{e^{z_1}+\cdots+e^{z_n}} \end{bmatrix}

ii 个Softmax值 eziez1++ezn\frac{e^{z_i}}{e^{z_1}+\cdots+e^{z_n}} 表示输入样本为第 ii 个类别的概率.

[z1z2zn]Softmax[ez1eziez2ezieznezi]CrossEntropy(y1log(ez1ezi)y2log(ez2ezi)ynlog(eznezi))=onehotlog(ezjezi)\begin{bmatrix} z_1\\z_2\\\vdots\\z_n \end{bmatrix} \overset{\rm Softmax}{\longrightarrow} \begin{bmatrix} \frac{e^{z_1}}{\sum e^{z_i}}\\\frac{e^{z_2}}{\sum e^{z_i}}\\\vdots\\\frac{e^{z_n}}{\sum e^{z_i}} \end{bmatrix} \overset{\rm CrossEntropy}{\longrightarrow} \sum \begin{pmatrix} -y_1\cdot\log(\frac{e^{z_1}}{\sum e^{z_i}})\\ -y_2\cdot\log(\frac{e^{z_2}}{\sum e^{z_i}})\\ \vdots\\ -y_n\cdot\log(\frac{e^{z_n}}{\sum e^{z_i}}) \end{pmatrix} \overset{\rm one-hot}{=} -\log(\frac{e^{z_j}}{\sum e^{z_i}})

其中 y=[y1,y2,,yn]\bold{y}=[y_1,y_2,\cdots,y_n] 是类别标签的one-hot独热编码, 例如 [1,0,,0][1,0,\cdots,0] 表示第1个类别的标签, [0,1,,0][0,1,\cdots,0] 表示第2个类别的标签, [0,0,,1][0,0,\cdots,1] 表示第n个类别的标签. 由于独热编码只有一个位置的值是1, 其它位置的值都是0, 所以最后只剩下一项, 其他项全为0.

softmax的导数

令网络的的输出为

z=[z1,z2,,zn]\bold{z} = [z_1,z_2,\cdots,z_n]

Softmax的输出为

y^=[y^1,y^2,,y^n]=softmax(z)=[ez1ezi,ez2ezi,,eznezi]\bold{\hat{y}} = [\hat{y}_1,\hat{y}_2,\cdots,\hat{y}_n] = {\rm softmax}(\bold{z}) = [\frac{e^{z_1}}{\sum e^{z_i}},\frac{e^{z_2}}{\sum e^{z_i}},\cdots,\frac{e^{z_n}}{\sum e^{z_i}}]

jj 个类别的标签为

y(j)=[01,,0j1,1j,0j+1,,0n]\bold{y}^{(j)} = [\underset{1}{0},\cdots,\underset{j-1}{0},\underset{j}{1},\underset{j+1}{0},\cdots,\underset{n}{0}]

计算 y\bold{y}y^\bold{\hat{y}} 两者的交叉熵得到

CrossEntropy(y,y^)=j=1nyjlog(ezji=1nezi)=j=1n(yj[log(i=1nezi)log(ezj)])=j=1nyjlog(i=1nezi)j=1nyjzj(yi=1)=log(i=1nezi)j=1nyjj=1nyjzj=log(i=1nezi)j=1nyjzj\begin{split} {\rm CrossEntropy}(\bold{y},\bold{\hat{y}}) &= -\sum_{j=1}^n y_j\log(\frac{e^{z_j}}{\sum_{i=1}^n e^{z_i}})\\ &= \sum_{j=1}^n\left(y_j\left[\log(\sum_{i=1}^n e^{z_i}) - \log(e^{z_j})\right]\right)\\ &= \sum_{j=1}^ny_j\log(\sum_{i=1}^n e^{z_i}) - \sum_{j=1}^ny_jz_j\\ (\sum y_i = 1)\rightarrow&= \log(\sum_{i=1}^n e^{z_i})\cdot\sum_{j=1}^ny_j - \sum_{j=1}^ny_jz_j\\ &= \log(\sum_{i=1}^n e^{z_i}) - \sum_{j=1}^ny_jz_j \end{split}

特别的对于one-hot编码的标签 y(j)\bold{y}^{(j)}

CrossEntropy(y(j),y^)=log(i=1nezi)zj{\rm CrossEntropy}(\bold{y}^{(j)},\bold{\hat{y}}) = \log(\sum_{i=1}^n e^{z_i}) - z_j

计算导数方法一

[
\begin{split}
{\rm CrossEntropy}(\bold{y},\bold{\hat{y}}) &=
{\rm CrossEntropy}\big(\bold{y},{\rm softmax}(\bold{z})\big)\
\bold{y} 和 \bold{z} 是行向量\rightarrow&=
-\bold{y}\log\left[{\rm softmax}(\bold{z}^T)\right]\
(\cdot,\cdot)为内积运算\rightarrow&=
-\Big(\bold{y},\log\left[{\rm softmax}(\bold{z})\right]\Big)
\end{split}
]

所以

plaintext
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
\begin{split}
\frac{\partial\ {\rm CrossEntropy}(\bold{y},\bold{\hat{y}})}{\partial \bold{z}}
&=
\frac{\partial\ {\rm CrossEntropy}(\bold{y},\bold{\hat{y}})}{\partial \bold{z}}\\
&=
-\frac{\partial\Big(\bold{y},\log\left[{\rm softmax}(\bold{z})\right]\Big)}{\partial\bold{z}}\\
&=
-\frac{\partial\Big(\bold{y},\log(\bold{\hat{y}})\Big)}{\partial\log(\bold{\hat{y}})}
\frac{\partial\log(\bold{\hat{y}}^T)}{\partial\bold{\hat{y}}}
\frac{\partial\bold{\hat{y}}^T}{\partial\bold{z}}\\
&=
-\bold{y}
\begin{bmatrix}
\frac{1}{\hat{y}_1}&0&\cdots&0\\
0&\frac{1}{\hat{y}_2}&\cdots&0\\
\vdots&\vdots&\ddots&\vdots\\
0&0&\cdots&\frac{1}{\hat{y}_n}\\
\end{bmatrix}
\frac{\partial\bold{\hat{y}}^T}{\partial\bold{z}}\\
&=
-\bold{y}
\left(E-
\begin{bmatrix}
\hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n\\
\hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n\\
\vdots&\vdots&\ddots&\vdots\\
\hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n\\
\end{bmatrix}
\right)\\
&=
\bold{y}
\begin{bmatrix}
\hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n\\
\hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n\\
\vdots&\vdots&\ddots&\vdots\\
\hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n\\
\end{bmatrix}
-
\bold{y}\\
&=
\begin{bmatrix}
\hat{y}_1\sum y_i&\hat{y}_2\sum y_i&\cdots&\hat{y}_n\sum y_i
\end{bmatrix}
-\bold{y}\\
(\sum y_i = 1)\rightarrow&=
\begin{bmatrix}
\hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n
\end{bmatrix}
-\bold{y}\\
&=
{\rm softmax}(\bold{z})-\bold{y}
\end{split}

其中

y^Tz= softmax(zT)z=[ez1ez1++ezn,ez2ez1++ezn,,eznez1++ezn]T[z1,z2,,zn]=[y^1(1y^1)y^2(y^1)y^n(y^1)y^1(y^2)y^2(1y^2)y^n(y^2)y^1(y^n)y^2(y^n)y^n(1y^n)]=[y^1000y^2000y^n](E[y^1y^2y^ny^1y^2y^ny^1y^2y^n])\begin{split} \frac{\partial\bold{\hat{y}}^T}{\partial \bold{z}} =\frac{\partial\ {\rm softmax(\bold{z}^T)}}{\partial \bold{z}} &= \frac{\partial \begin{bmatrix} \frac{e^{z_1}}{e^{z_1}+\cdots+e^{z_n}}, \frac{e^{z_2}}{e^{z_1}+\cdots+e^{z_n}}, \cdots, \frac{e^{z_n}}{e^{z_1}+\cdots+e^{z_n}} \end{bmatrix}^T} {\partial [z_1,z_2,\cdots,z_n]}\\ &= \begin{bmatrix} \hat{y}_1(1-\hat{y}_1) & \hat{y}_2(-\hat{y}_1) & \cdots & \hat{y}_n(-\hat{y}_1)\\ \hat{y}_1(-\hat{y}_2) & \hat{y}_2(1-\hat{y}_2) & \cdots & \hat{y}_n(-\hat{y}_2)\\ \vdots&\vdots&\ddots&\vdots\\ \hat{y}_1(-\hat{y}_n) & \hat{y}_2(-\hat{y}_n) & \cdots & \hat{y}_n(1-\hat{y}_n)\\ \end{bmatrix}\\ &= \begin{bmatrix} \hat{y}_1&0&\cdots&0\\ 0&\hat{y}_2&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\hat{y}_n\\ \end{bmatrix} \left(E- \begin{bmatrix} \hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n\\ \hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n\\ \vdots&\vdots&\ddots&\vdots\\ \hat{y}_1&\hat{y}_2&\cdots&\hat{y}_n\\ \end{bmatrix} \right) \end{split}

计算导数方法二

求交叉熵关于 z\bold{z} 的导数

 CE(y,y^)z=[ CEz1, CEz2,, CEzn]=[smx(z1)y1,smx(z2)y2,,smx(zn)yn]=softmax([z1,z2,,zn])[y1,y2,,yn]=softmax(z)y\begin{split} \frac{\partial\ {\rm CE}(\bold{y},\bold{\hat{y}})}{\partial \bold{z}} &= \begin{bmatrix} \frac{\partial\ {\rm CE}}{\partial z_1}, \frac{\partial\ {\rm CE}}{\partial z_2}, \cdots, \frac{\partial\ {\rm CE}}{\partial z_n} \end{bmatrix}\\ &= \begin{bmatrix} {\rm smx}(z_1)-y_1,&{\rm smx}(z_2)-y_2&,\cdots,&{\rm smx}(z_n)-y_n \end{bmatrix}\\ &= {\rm softmax}([z_1,z_2,\cdots,z_n])-[y_1,y_2,\cdots,y_n]\\ &= {\rm softmax}(\bold{z})-\bold{y} \end{split}

其中

CrossEntropy(y,y^)=log(i=1nezi)j=1nyjzj{\rm CrossEntropy}(\bold{y},\bold{\hat{y}}) = \log(\sum_{i=1}^n e^{z_i}) - \sum_{j=1}^ny_jz_j

 CE(y,y^)zi=zilog(i=1nezi)zij=1nyjzj=ezik=1nezkyi=softmax(z)iyi\begin{split} \frac{\partial\ {\rm CE}(\bold{y},\bold{\hat{y}})}{\partial z_i} &= \frac{\partial}{\partial z_i}\log(\sum_{i=1}^n e^{z_i}) - \frac{\partial}{\partial z_i}\sum_{j=1}^ny_jz_j\\ &= \frac{e^{z_i}}{\sum_{k=1}^ne^{z_k}} - y_i\\ &= {\rm softmax}(\bold{z})_i - y_i \end{split}

draft

softmax(z)=[ez1ezi,ez2ezi,,eznezi]{\rm softmax}(\bold{z}) = [\frac{e^{z_1}}{\sum e^{z_i}},\frac{e^{z_2}}{\sum e^{z_i}},\cdots,\frac{e^{z_n}}{\sum e^{z_i}}]

CrossEntropy(y,y^)=CrossEntropy(y,softmax(z))yz是行向量=ylog[softmax(zT)](,)为内积运算=(y,log[softmax(z)])\begin{split} {\rm CrossEntropy}(\bold{y},\bold{\hat{y}}) &= {\rm CrossEntropy}\big(\bold{y},{\rm softmax}(\bold{z})\big)\\ \bold{y} 和 \bold{z} 是行向量\rightarrow&= -\bold{y}\log\left[{\rm softmax}(\bold{z}^T)\right]\\ (\cdot,\cdot)为内积运算\rightarrow&= -\Big(\bold{y},\log\left[{\rm softmax}(\bold{z})\right]\Big) \end{split}

logsoftmax(z)=zlog(ez)\begin{array}{c} \rm log \circ softmax(\bold{z}) =\bold{z}-\log(\sum\bold{e^z}) \end{array}

PloyLoss

draft

log(x+1)=xx22+x33++(1)n+1xnn+O(xn)\log(x+1) = x - \frac{x^2}{2} + \frac{x^3}{3}+\cdots+(-1)^{n+1}\frac{x^n}{n}+O(x^n)

log(x+1)x1(x0)\frac{\log(x+1)}{x} \sim 1 \quad (x \rightarrow 0)

log(x)=log((x1)+1)=(x1)(x1)22+(x1)33++(1)(n+1)(x1)nn+O(xn)\log(x) = \log((x-1)+1) = (x-1) - \frac{(x-1)}{2}^2 + \frac{(x-1)^3}{3}+\cdots+(-1)^{(n+1)}\frac{(x-1)^n}{n}+O(x^n)

log(x)x11(x1)\frac{\log(x)}{x-1} \sim 1 \quad(x\rightarrow 1)

log(x)=(x1)+O((x1)2)\log(x) = (x-1) + O((x-1)^2)

全连接神经网络FCN

FCN on MNIST

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import torch
from torchvision import transforms # 针对图像处理的工具包
from torchvision import datasets
from torch.utils.data import DataLoader
import torch.nn.functional as F # 包含Relu函数
import torch.optim as optim # 优化器工具包

batch_size = 64
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,),(0.3081,))
])

train_dataset = datasets.MNIST(root='../dataset/mnist/',
train=True,
download=False, #第一次运行设为True下载数据集
transform=transform)
test_dataset = datasets.MNIST(root='../dataset/mnist/',
train=False,
download=False, #第一次运行设为True下载数据集
transform=transform)

train_loader = DataLoader(train_dataset,
shuffle=True,
batch_size=batch_size)
test_loader = DataLoader(test_dataset,
shuffle=False,
batch_size=batch_size)

class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.l1 = torch.nn.Linear(784, 512)
self.l2 = torch.nn.Linear(512, 256)
self.l3 = torch.nn.Linear(256, 128)
self.l4 = torch.nn.Linear(128, 64)
self.l5 = torch.nn.Linear(64, 10)

def forward(self, x):
x = x.view(-1, 784)
x = F.relu(self.l1(x))
x = F.relu(self.l2(x))
x = F.relu(self.l3(x))
x = F.relu(self.l4(x))
return self.l5(x)

model = Net()

criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

def train(epoch):
running_loss = 0.0
for batch_idx, data in enumerate(train_loader, 0):
inputs, target = data
optimizer.zero_grad()

# forward + backward + update
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()

running_loss += loss.item()
if batch_idx % 300 == 299:
print('[%d, %5d] loss: %.3f' % (epoch + 1, batch_idx + 1, running_loss / 300))
running_loss = 0.0

def test():
correct = 0
total = 0
with torch.no_grad():
for data in test_loader:
images, labels = data
outputs = model(images)
_, predicted = torch.max(outputs.data, dim=1)
total += labels.size(0)
correct += (predicted == labels).sum().item()

print('Accuracy on test set: %d %%' % (100 * correct / total))

if __name__=='__main__':
for epoch in range(10):
train(epoch)
test()

卷积神经网络CNN

输入输出尺寸的计算

Out=(InKernal+2Padding)/Stride+1Out = (In - Kernal + 2Padding)/Stride + 1

演示1

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import torch
in_channels, out_channels = 5, 10
width, height = 100, 100
kernel_size = 3
batch_size = 1

input = torch.randn(batch_size,
in_channels,
width,
height)

conv_layer = torch.nn.Conv2d(in_channels,
out_channels,
kernel_size=kernel_size)

output = conv_layer(input)

print(input.shape) #torch.Size([1, 5, 100, 100])
print(output.shape) #torch.Size([1, 10, 98, 98])
print(conv_layer.weight.shape) #torch.Size([10, 5, 3, 3])

演示2 padding操作

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import torch

# 卷积操作对象
conv_layer = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1, bias=False)
# 设置卷积核参数
kernel = torch.arange(1, 10, dtype=torch.float32).view(1,1,3,3)
conv_layer.weight.data = kernel.data

# 输入+卷积输出
input = torch.ones(1, 1, 5, 5, dtype=torch.float32)
output = conv_layer(input)

print(output.size()) # torch.Size([1, 1, 5, 5])
print(output)
# tensor([[[[28., 39., 39., 39., 24.],
# [33., 45., 45., 45., 27.],
# [33., 45., 45., 45., 27.],
# [33., 45., 45., 45., 27.],
# [16., 21., 21., 21., 12.]]]], grad_fn=<ConvolutionBackward0>)

演示3 MaxPool

python
1
2
3
4
5
6
7
8
9
10
11
12
import torch

maxpooling_layer = torch.nn.MaxPool2d(kernel_size=2)

# 输入+卷积输出
input = torch.ones(1, 1, 5, 5, dtype=torch.float32)
output = maxpooling_layer(input)

print(output.size()) # torch.Size([1, 1, 2, 2])
print(output)
# tensor([[[[1., 1.],
# [1., 1.]]]])

演示4 CNN

网络结构

Input Layerinput(batch,1,28,28)Conv2d Layer1Cin=1,Cout=10,kernel=5(batch,10,24,24)Rule Layer1Rule Layer(batch,10,24,24)Pooling Layer1kernel=2×2(batch,10,12,12)Conv2d Layer2Cin=10,Cout=20,kernel=5(batch,20,8,8)Rule Layer2Rule Layer(batch,20,8,8)Pooling Layer2kernel=2×2(batch,20,4,4)(batch,320)Linear Layerfin=320,fout=10(batch,10)Output LayerOutput\begin{array}{rcl} \text{Input Layer} & \fbox{input}\\ &\downarrow & (batch,1,28,28)\\ \text{Conv2d Layer1} & \fbox{$C_{in}=1,C_{out}=10,kernel=5$}\\ &\downarrow & (batch,10,24,24)\\ \text{Rule Layer1} & \fbox{Rule Layer}\\ &\downarrow & (batch,10,24,24)\\ \text{Pooling Layer1} & \fbox{$kernel=2\times2$}\\ &\downarrow & (batch,10,12,12)\\ \text{Conv2d Layer2} & \fbox{$C_{in}=10,C_{out}=20,kernel=5$}\\ &\downarrow & (batch,20,8,8)\\ \text{Rule Layer2} & \fbox{Rule Layer}\\ &\downarrow & (batch,20,8,8)\\ \text{Pooling Layer2} & \fbox{$kernel=2\times2$}\\ &\downarrow & (batch,20,4,4) \rightarrow (batch,320)\\ \text{Linear Layer} & \fbox{$f_{in}=320,f_{out}=10$}\\ &\downarrow & (batch,10)\\ \text{Output Layer} & \fbox{Output} \end{array}

模型的代码

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import torch
import torch.nn.functional as F # 包含Relu函数

class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = torch.nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = torch.nn.Conv2d(10, 20, kernel_size=5)
self.pooling = torch.nn.MaxPool2d(2)
self.fc = torch.nn.Linear(320, 10)

def forward(self, x):
batch_size = x.size(0)
x = self.pooling(F.relu(self.conv1(x)))
x = self.pooling(F.relu(self.conv2(x)))
x = x.view(batch_size, -1)
x = self.fc(x)
return x

将上面的 Net 代码整个替换掉之前的 FCN 代码中的 Net 就可以跑通了

CNN on MNIST

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import torch
from torchvision import transforms # 针对图像处理的工具包
from torchvision import datasets
from torch.utils.data import DataLoader
import torch.nn.functional as F # 包含Relu函数
import torch.optim as optim # 优化器工具包

batch_size = 64
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,),(0.3081,))
])

train_dataset = datasets.MNIST(root='../dataset/mnist/',
train=True,
download=False, #第一次运行设为True下载数据集
transform=transform)
test_dataset = datasets.MNIST(root='../dataset/mnist/',
train=False,
download=False, #第一次运行设为True下载数据集
transform=transform)

train_loader = DataLoader(train_dataset,
shuffle=True,
batch_size=batch_size)
test_loader = DataLoader(test_dataset,
shuffle=False,
batch_size=batch_size)

class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = torch.nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = torch.nn.Conv2d(10, 20, kernel_size=5)
self.pooling = torch.nn.MaxPool2d(2)
self.fc = torch.nn.Linear(320, 10)

def forward(self, x):
batch_size = x.size(0)
x = self.pooling(F.relu(self.conv1(x)))
x = self.pooling(F.relu(self.conv2(x)))
x = x.view(batch_size, -1)
x = self.fc(x)
return x

model = Net()

criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

def train(epoch):
running_loss = 0.0
for batch_idx, data in enumerate(train_loader, 0):
inputs, target = data
optimizer.zero_grad()

# forward + backward + update
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()

running_loss += loss.item()
if batch_idx % 300 == 299:
print('[%d, %5d] loss: %.3f' % (epoch + 1, batch_idx + 1, running_loss / 300))
running_loss = 0.0

def test():
correct = 0
total = 0
with torch.no_grad():
for data in test_loader:
images, labels = data
outputs = model(images)
_, predicted = torch.max(outputs.data, dim=1)
total += labels.size(0)
correct += (predicted == labels).sum().item()

print('Accuracy on test set: %d %%' % (100 * correct / total))

if __name__=='__main__':
for epoch in range(10):
train(epoch)
test()