MNIST手写数字数据集介绍

参考:MNIST手写数据集的Python读取

  • MNIST数据集的train-images.idex3-ubyte是60000张28×2828\times28像素的图片,我们将其转换成784×1784\times1的向量

  • MNIST数据集的train-labels.idex1-ubyte是60000个[09][0-9]的整数,我们将其转换成10×110\times1one-hot向量

网络结构设计

采用两层前馈网络结构, 输入层设置784个节点, 隐藏层设置30个节点, 输出层设置10个节点

mnist

正向传播

输入层

oiI=xiI,i=1,2,,784o^I_i= x_i^I,\qquad i=1,2,\cdots,784

隐藏层

xjJ=i=1784wjiJoiI+bjJojJ=σ(xjJ),j=1,2,,30\begin{split} x_j^J &= \sum_{i=1}^{784} w_{ji}^Jo_i^I + b_j^J\\ o_j^J &= \sigma(x_j^J) \end{split} ,\qquad j=1,2,\cdots,30

[o1Jo2Jo30J]=σ([w11Jw12Jw1,784Jw21Jw22Jw2,784Jw30,1Jw30,2Jw30,784J][o1Io2Io784I]+[b1Jb2Jb30J])\begin{bmatrix} o_1^J\\ o_2^J \\ \vdots \\o_{30}^J \end{bmatrix} = \sigma\left( \begin{bmatrix} w_{11}^J & w_{12}^J & \cdots & w_{1,784}^J\\ w_{21}^J & w_{22}^J & \cdots & w_{2,784}^J\\ \vdots & \vdots & \ddots & \vdots \\ w_{30,1}^J & w_{30,2}^J & \cdots & w_{30,784}^J\\ \end{bmatrix} \begin{bmatrix} o_1^I\\ o_2^I \\ \vdots \\o_{784}^I \end{bmatrix} + \begin{bmatrix} b_1^J\\ b_2^J \\ \vdots \\b_{30}^J \end{bmatrix} \right)

输出层

xkK=j=130wkjKojJ+bkKokK=σ(xkK),k=1,2,,10\begin{split} x_k^K &= \sum_{j=1}^{30} w_{kj}^Ko_j^J+b_k^K \\ o_k^K &= \sigma(x_k^K) \end{split} ,\qquad k=1,2,\cdots,10

[o1Ko2Ko10K]=σ([w11Kw12Kw1,30Kw21Kw22Kw2,30Kw10,1Kw10,2Kw10,30K][o1Jo2Jo30J]+[b1Kb2Kb10K])\begin{bmatrix} o_1^K\\ o_2^K \\ \vdots \\o_{10}^K \end{bmatrix} = \sigma\left( \begin{bmatrix} w_{11}^K & w_{12}^K & \cdots & w_{1,30}^K\\ w_{21}^K & w_{22}^K & \cdots & w_{2,30}^K\\ \vdots & \vdots & \ddots & \vdots \\ w_{10,1}^K & w_{10,2}^K & \cdots & w_{10,30}^K\\ \end{bmatrix} \begin{bmatrix} o_1^J\\ o_2^J \\ \vdots \\o_{30}^J \end{bmatrix} + \begin{bmatrix} b_1^K\\ b_2^K \\ \vdots \\b_{10}^K \end{bmatrix} \right)

整体

okK=σ(j=130wkjK(σ(i=1784wjiJoiI+bjJ))+bkK),k=1,2,,10o_k^K = \sigma{\Huge(}\sum_{j=1}^{30} w_{kj}^K\Big(\sigma(\sum_{i=1}^{784} w_{ji}^Jo_i^I+b_j^J)\Big)+b_k^K{\Huge)} ,\qquad k=1,2,\cdots,10

oK=σ(WKoJ+bK)=σ(WK[σ(WJxI+bJ)]+bK)\begin{split} \boldsymbol{o}^K &= \sigma(W^K\boldsymbol{o}^J+\boldsymbol{b}^K)\\ &=\sigma\Big(W^K\big[\sigma(W^J\boldsymbol{x}^I+\boldsymbol{b}^J)\big]+\boldsymbol{b}^K\Big) \end{split}

误差函数

对于一个样本([x1,,x784],[t1,,t10])([x_1,\cdots,x_{784}],[t_1,\cdots,t_{10}])的输出[o1,,o10][o_1,\cdots,o_{10}], 采用平方和来计算误差

E=12k=110(okKtk)2E = \frac{1}{2}\sum_{k=1}^{10}(o_k^K-t_k)^2

反向传播

输出层梯度计算

  • 输出层系数矩阵梯度的计算

EwkjK=EokKokKxkKxkKwkjK=(okKtk)okK(1okK)ojJk=1,2,,10j=1,2,,30\begin{gather*} \begin{split} \frac{\partial E}{\partial w_{kj}^K} &= \frac{\partial E}{\partial o_k^K}\cdot \frac{\partial o_k^K}{\partial x_k^K}\cdot \frac{\partial x_k^K}{\partial w_{kj}^K}\\ &= (o_k^K - t_k) \cdot o_k^K(1-o_k^K) \cdot o_j^J \\ \end{split} \\ k=1,2,\cdots,10\qquad j=1,2,\cdots,30 \end{gather*}

  • 输出层偏置向量梯度的计算

EbkK=EokKokKxkKxkKbkK=(okKtk)okK(1okK)1k=1,2,,10\begin{gather*} \begin{split} \frac{\partial E}{\partial b_{k}^K} &= \frac{\partial E}{\partial o_k^K}\cdot \frac{\partial o_k^K}{\partial x_k^K}\cdot \frac{\partial x_k^K}{\partial b_{k}^K}\\ &= (o_k^K - t_k) \cdot o_k^K(1-o_k^K) \cdot1\\ \end{split} \\ k=1,2,\cdots,10 \end{gather*}

隐藏层梯度计算

  • 隐藏层系数矩阵梯度的计算

EwjiJ=k=110EokKokKxkKxkKojJojJxjJxjJwjiJ=k=110(okKtk)okK(1okK)wkjKojJ(1ojJ)oiI=(ojJ(1ojJ)oiI)(k=110(okKtk)okK(1okK)wkjK)j=1,2,,30i=1,2,,784\begin{gather*} \begin{split} \frac{\partial E}{\partial w_{ji}^J} &= \sum_{k=1}^{10} \frac{\partial E}{\partial o_k^K}\cdot \frac{\partial o_k^K}{\partial x_k^K}\cdot \frac{\partial x_k^K}{\partial o_j^J}\cdot \frac{\partial o_j^J}{\partial x_j^J}\cdot \frac{\partial x_j^J}{\partial w_{ji}^J} \\ &= \sum_{k=1}^{10} (o_k^K - t_k) \cdot o_k^K(1-o_k^K) \cdot w_{kj}^K \cdot o_j^J(1-o_j^J) \cdot o_i^I\\ &= \Big( o_j^J(1-o_j^J) \cdot o_i^I \Big) \cdot \Big( \sum_{k=1}^{10}(o_k^K - t_k) \cdot o_k^K(1-o_k^K) \cdot w_{kj}^K \Big) \end{split} \\ j=1,2,\cdots,30\qquad i=1,2,\cdots,784 \end{gather*}

  • 隐藏层偏置向量梯度的计算

EbjJ=k=110EokKokKxkKxkKojJojJxjJxjJbjJ=k=110(okKtk)okK(1okK)wkjKojJ(1ojJ)1=ojJ(1ojJ)(k=110(okKtk)okK(1okK)wkjK)j=1,2,,30\begin{gather*} \begin{split} \frac{\partial E}{\partial b_j^J} &= \sum_{k=1}^{10} \frac{\partial E}{\partial o_k^K}\cdot \frac{\partial o_k^K}{\partial x_k^K}\cdot \frac{\partial x_k^K}{\partial o_j^J}\cdot \frac{\partial o_j^J}{\partial x_j^J}\cdot \frac{\partial x_j^J}{\partial b_j^J} \\ &= \sum_{k=1}^{10} (o_k^K - t_k) \cdot o_k^K(1-o_k^K) \cdot w_{kj}^K \cdot o_j^J(1-o_j^J) \cdot 1\\ &= o_j^J(1-o_j^J) \cdot \Big( \sum_{k=1}^{10}(o_k^K - t_k) \cdot o_k^K(1-o_k^K) \cdot w_{kj}^K \Big) \end{split} \\ j=1,2,\cdots,30 \end{gather*}

向量化计算梯度

计算输出层的误差项δKδkK=(okKtk)okK(1okK),k=1,2,,10输出层偏置向量梯度EbkK=δk,k=1,2,,10输出层系数矩阵梯度EwkjK=δkKojJ,k=1,2,,10j=1,2,,30=[δ1Kδ2Kδ10K][o1Jo2Jo30J]计算隐藏层的误差项δJδjJ=ojJ(1ojJ)(k=110δkKwkjK),j=1,2,,30=[o1J(1o1J)00o30J(1o30J)][w1,1Kw1,30Kw10,1Kw10,30K]T[δ1Kδ10K]隐藏层偏置向量梯度EbjJ=δjJ,j=1,2,,30隐藏层系数矩阵梯度EwjiJ=δjJoiI,j=1,2,,30i=1,2,,784=[δ1Jδ2Jδ30J][o1Io2Io784I]\begin{split} 计算输出层的误差项\delta^K\rightarrow \delta_k^K &= (o_k^K - t_k) \cdot o_k^K(1-o_k^K),\qquad k=1,2,\cdots,10 \\ 输出层偏置向量梯度\rightarrow \frac{\partial E}{\partial b_k^K} &= \delta_k, \qquad k=1,2,\cdots,10 \\ 输出层系数矩阵梯度\rightarrow \frac{\partial E}{\partial w_{kj}^K} &= \delta_k^K \cdot o_j^J, \qquad k=1,2,\cdots,10\quad j=1,2,\cdots,30 \\ &= \begin{bmatrix}\delta^K_1\\\delta^K_2\\\vdots\\\delta^K_{10}\end{bmatrix} \begin{bmatrix} o^J_1 & o^J_2 & \cdots & o^J_{30}\end{bmatrix} \\\\ 计算隐藏层的误差项\delta^J\rightarrow \delta_j^J &= o_j^J(1-o_j^J) \cdot \Big(\sum_{k=1}^{10}\delta_k^K \cdot w_{kj}^K\Big),\qquad j=1,2,\cdots,30 \\ &= \begin{bmatrix} o_1^J(1-o_1^J)&\cdots&0\\ \vdots&\ddots &\vdots\\ 0 &\cdots&o_{30}^J(1-o_{30}^J) \end{bmatrix} \begin{bmatrix} w_{1,1}^K&\cdots&w_{1,30}^K \\ \vdots&\ddots&\vdots \\ w_{10,1}^K&\cdots&w_{10,30}^K \end{bmatrix}^T \begin{bmatrix} \delta^K_1\\ \vdots\\ \delta^K_{10} \end{bmatrix} \\ 隐藏层偏置向量梯度\rightarrow \frac{\partial E}{\partial b_j^J} &= \delta_j^J,\qquad j=1,2,\cdots,30 \\ 隐藏层系数矩阵梯度\rightarrow \frac{\partial E}{\partial w_{ji}^J} &= \delta_j^J\cdot o_i^I,\qquad j=1,2,\cdots,30\quad i=1,2,\cdots,784 \\ &= \begin{bmatrix}\delta^J_1\\\delta^J_2\\\vdots\\\delta^J_{30}\end{bmatrix} \begin{bmatrix} o^I_1 & o^I_2 & \cdots & o^I_{784}\end{bmatrix} \end{split}

Python 代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
import numpy as np
import random


def sigmoid(x):
return 1.0/(1.0+np.exp(-x))


# def sigmoid_prime(x):
# return sigmoid(x)*(1-sigmoid(x))


class MLP_np:
def __init__(self, sizes):
"""
: param sizes: [784,30, 10]
"""
self.sizes = sizes
# sizes = [784, 30, 10]
# w: [size_in, size_out]
# b: [size_out]
'''
weights : (2,) <class 'list'>
weights[0] : (30, 784) <class 'numpy.ndarray'>
weights[1] : (10, 30) <class 'numpy.ndarray'>
biases : (2,) <class 'list'>
biases[0] : (30,) <class 'numpy.ndarray'>
biases[1] : (10,) <class 'numpy.ndarray'>
'''
self.weights = [np.random.randn(size_out, size_in) for size_in, size_out in zip(sizes[:-1], sizes[1:])] # [784, 30], [30, 10]
self.biases = [np.random.randn(size_out, 1) for size_out in sizes[1:]]

def forward(self, x):
'''
一次处理一个样本
:parm x: [784, 1]
:return: [10, 1]
'''
o = x # 输入层
for w,b in zip(self.weights, self.biases):
x = np.dot(w, o) + b
o = sigmoid(x)
return o

def backprop(self, x, t):
'''
:parm x: [784, 1]
:parm t: [10, 1], one_hot encoding
'''
# 用于存放weights和biases的梯度
nabla_w = [np.zeros(w.shape) for w in self.weights]
nabla_b = [np.zeros(b.shape) for b in self.biases]
# 1. forward
# save output for every layer
os = [x]
# save x for every layer
xs = []
# forward
o = x # 输入层
for w, b in zip(self.weights, self.biases):
# 向前计算
x = np.dot(w, o) + b
o = sigmoid(x)
# 记录中间值
xs.append(x)
os.append(o)

# 计算损失函数
loss = sum(np.power(os[-1]-t, 2))/2

'''
(30,784)@(784,1) => (10,30)@(30,1) => (10,1)
os[-1]:(10,1) os[-2]:(30,1) os[-3]:(784,1)
nable_b[-1]:(10,1) nable_b[-2]:(30,1)
nable_w[-1]:(10,30) nable_w[-2]:(30,784)
delta_K:(10,1) delta_J:(30,1)
'''
# 2. backwoard
# 2.1 compute gradient on output layer
delta_K = os[-1]*(1-os[-1])*(os[-1]-t)
nabla_b[-1] = delta_K
nabla_w[-1] = np.dot(delta_K, os[-2].T)

# 2.2 compute gradient on hidden layer
delta_J = os[-2]*(1-os[-2])*np.dot(self.weights[-1].T, delta_K)
nabla_b[-2] = delta_J
nabla_w[-2] = np.dot(delta_J, os[-3].T)

return nabla_w, nabla_b, loss

def train(self, training_data, epochs, batch_size, lr, test_data):
'''
:param training_data: list of (x,t)
:param epochs: 1000
:param batch_size: 10
:lr: 0.01 learning rate
:test_data: list of (x,t)
'''
n_test = len(test_data)
n_train = len(training_data)
for j in range(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k+batch_size] for k in range(0, n_train - batch_size, batch_size)
]

# for every batch in current batches
for batch in mini_batches:
loss = self.update_mini_batch(batch, lr)

if test_data:
print("Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test))
print("Loss:{}".format(loss))
else:
print("Epoch {0} complete".format(j))

def update_mini_batch(self, batch, lr):
"""
batch: list of (x,y)
lr: 0.01
"""
# 用于存放weights和biases的梯度
nabla_w = [np.zeros(w.shape) for w in self.weights]
nabla_b = [np.zeros(b.shape) for b in self.biases]
loss = 0

# for every sample in current batch
for x, t in batch:
# list of every w/b gradient
# [w1,w2,...,w10],[b1,b2,...,b10]
nabla_w_, nabla_b_, loss_= self.backprop(x, t)
nabla_w[0] += nabla_w_[0]
nabla_w[1] += nabla_w_[1]
nabla_b[0] += nabla_b_[0]
nabla_b[1] += nabla_b_[1]
loss += loss_
nabla_w = [w/len(batch) for w in nabla_w]
nabla_b = [b/len(batch) for b in nabla_b]
loss = loss/len(batch)

# w = w - lr*nabla_w
self.weights = [w - lr*nabla for w, nabla in zip(self.weights, nabla_w)]
self.biases = [b - lr*nabla for b, nabla in zip(self.biases, nabla_b)]

return loss

def evaluate(self, test_data):
"""
test_data: list of (x, t)
"""
result = [(np.argmax(self.forward(x)), (np.argmax(t))) for x, t in test_data]
correct = sum([int(pred == t) for pred, t in result])
return correct

def main():
from mnist_np import get_dataset
training_data, test_data = get_dataset()
print(len(training_data), training_data[0][0].shape, training_data[0][1].shape)

# Set up a Network with 30 hidden neurons
net = MLP_np([784, 30, 10])

net.train(training_data, 1000, 10, 0.01, test_data=test_data)

if __name__ == '__main__':
main()

  • mnist_np.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import struct
import numpy as np
import matplotlib.pyplot as plt


def load_images(file_name):
# 在读取或写入一个文件之前,你必须使用 Python 内置open()函数来打开它
# file object = open(file_name [, access_mode][, buffering])
# file_name是包含您要访问的文件名的字符串值
# access_mode指定该文件已被打开,即读,写,追加等方式
# 0表示不使用缓冲,1表示在访问一个文件时进行缓冲
# 这里‘rb’表示以二进制只读一个文件
binfile = open(file_name, 'rb')
# 从一个打开的文件读取数据
buffers = binfile.read()
# 读取image文件前4个整型数字,‘>’表示大端,‘IIII’表示4个int
# ‘0’表示offset=0(偏移量=0),即从0位置开始读取数据
magic, num, rows, cols = struct.unpack_from('>IIII', buffers, 0)
# 'train-images.idx3-ubyte'中整个images数据大小为60000*28*28
# 't10k-images.idx3-ubyte'中整个images数据大小为10000*28*28
# 每个数据都是unsigned char类型,大小为1个字节(byte)
Bytes = num * rows * cols
# ‘>’表示大端,‘B’表示unsigned char,大小为1个字节,‘str(Bytes)’表示把Bytes转换成字符串
# '>'+str(Bytes)+'B'表示从大端开始读取Bytes个unsigned char类型个数据
# ‘struct.calcsize('>IIII')’表示计算‘>IIII’的大小
# 这里的大小是struct.calcsize('>IIII') = 16 = 4*sizeof(int)
# 即offset = 16, 不读取前16为的数据,从第17位开始读取
# struct.unpack_from()返回的类型是元组,即type(images) = <class 'tuple'>
# 从'train-images.idx3-ubyte'中读取的images元组中有47040000 = 60000*28*28个元素
# 从't10k-images.idx3-ubyte'中读取的images元组中有7840000 = 10000*28*28个元素
images = struct.unpack_from('>' + str(Bytes) + 'B', buffers, struct.calcsize('>IIII'))
# 关闭文件
binfile.close()
# 将从'train-images.idx3-ubyte'中读取的images元组转换为[60000,784]型数组
# 将从't10k-images.idx3-ubyte'中读取的images元组转换为[10000,784]型数组
images = np.reshape(images, [num, rows * cols])
return images


def load_labels(file_name):
# 打开文件
binfile = open(file_name, 'rb')
# 从一个打开的文件读取数据
buffers = binfile.read()
# 读取label文件前2个整形数字,label的长度为num
magic, num = struct.unpack_from('>II', buffers, 0)
# 读取labels数据
labels = struct.unpack_from('>' + str(num) + "B", buffers, struct.calcsize('>II'))
# 关闭文件
binfile.close()
# 转换为一维数组
labels = np.reshape(labels, [num])
return labels

def onehot(label):
'''
label: (), <np.int32>, 取值0,1,2,...,9
'''
label_onehot = np.zeros([10,1])
label_onehot[label][0] = 1

'''
label_onehot: (10,1) <np.ndarray>
'''
return label_onehot

def get_dataset():
# 从官方提供的二进制文件中提取图片数据和标签数据
# 下面四行('')中的内容是路径,这里使用的是相对路径
"""
train_images: (60000, 784) <class 'numpy.ndarray'>
train_labels: (60000,) <class 'numpy.ndarray'>
test_images: (10000, 784) <class 'numpy.ndarray'>
test_labels: (10000,) <class 'numpy.ndarray'>
"""
train_images = load_images('train-images.idx3-ubyte')
train_labels = load_labels('train-labels.idx1-ubyte')
test_images = load_images('t10k-images.idx3-ubyte')
test_labels = load_labels('t10k-labels.idx1-ubyte')

# 训练用数据集
train_data = []
for image,label in zip(train_images, train_labels):
'''
image: (784,) <class 'numpy.ndarray'>
label: () <class 'numpy.int32'>
'''
image = image[:,np.newaxis] # 将(784,)的向量转换成(784,1)的向量
label = onehot(label) # 将int32的标签转换成(10,1)的onehot向量
'''
image: (784, 1) <class 'numpy.ndarray'>
label: (10, 1) <class 'numpy.ndarray'>
'''
train_data.append([image,label])

# 测试用数据集
test_data = []
for image,label in zip(test_images, test_labels):
'''
image: (784,) <class 'numpy.ndarray'>
label: () <class 'numpy.int32'>
'''
image = image[:,np.newaxis] # 将(784,)的向量转换成(784,1)的向量
label = onehot(label) # 将int32的标签转换成(10,1)的onehot向量
'''
image: (784, 1) <class 'numpy.ndarray'>
label: (10, 1) <class 'numpy.ndarray'>
'''
test_data.append([image,label])

'''
train_data: (60000, 2) <class 'list'>
test_data: (10000, 2) <class 'list'>
'''
return train_data, test_data


if __name__ == '__main__':
# 下面四行('')中的内容是路径,这里使用的是相对路径
train_images = load_images('train-images.idx3-ubyte')
train_labels = load_labels('train-labels.idx1-ubyte')
test_images = load_images('t10k-images.idx3-ubyte')
test_labels = load_labels('t10k-labels.idx1-ubyte')

print(train_images.shape, type(train_images))
print(train_labels.shape, type(train_labels))
print(test_images.shape, type(test_images))
print(test_labels.shape, type(test_labels))

# 将读取的图像绘制出来
fig = plt.figure(figsize=(8, 8))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
for i in range(30):
images = np.reshape(train_images[i], [28, 28])
ax = fig.add_subplot(6, 5, i+1, xticks=[], yticks=[])
ax.imshow(images, cmap=plt.cm.binary, interpolation='nearest')
ax.text(0, 7, str(train_labels[i]))
plt.show()