基本模型
输入层和循环层
对于一个输入信号u ( n ) ∈ R N u \boldsymbol{u}(n)\in\mathbb{R}^{N_u} u ( n ) ∈ R N u , 给定期望的目标输出信号y t a r g e t ( n ) ∈ R N y \boldsymbol{y}^{target}(n)\in\mathbb{R}^{N_y} y t a r g e t ( n ) ∈ R N y , 其中n = 1 , 2 , ⋯ , T n=1,2,\cdots,T n = 1 , 2 , ⋯ , T 是离散的时间点, T T T 是训练集中数据点的个数. 我们的目标是训练一个模型, 使得模型的输出y ( n ) ∈ R N y \boldsymbol{y}(n)\in\mathbb{R}^{N_y} y ( n ) ∈ R N y 尽可能地匹配期望的目标输出信号y t a r g e t ( n ) ∈ R N y \boldsymbol{y}^{target}(n)\in\mathbb{R}^{N_y} y t a r g e t ( n ) ∈ R N y , 即最小化y \boldsymbol{y} y 和y t a r g e t \boldsymbol{y}^{target} y t a r g e t 之间的误差测度E ( y , y t a r g e t ) E(\boldsymbol{y},\boldsymbol{y}^{target}) E ( y , y t a r g e t ) . 误差测度一般采用均方误差(MSE), 例如均方根误差(RMSE)
E ( y , y t a r g e t ) = 1 N y ∑ i = 1 N y 1 T ∑ n = 1 T ( y i ( n ) − y i t a r g e t ( n ) ) 2 = 1 N y ∑ n = 1 T 1 T ∥ y i − y i t a r g e t ∥ 2 2 E(\boldsymbol{y},\boldsymbol{y}^{target})=
\frac{1}{N_y}
\sum_{i=1}^{N_y}
\sqrt{\frac{1}{T}\sum_{n=1}^T(y_i(n)-y_i^{target}(n))^2}
\\
= \frac{1}{N_y}
\sum_{n=1}^{T}
\sqrt{\frac{1}{T}\|\boldsymbol{y}_i-\boldsymbol{y}^{target}_i\|^2_2}
E ( y , y t a r g e t ) = N y 1 i = 1 ∑ N y T 1 n = 1 ∑ T ( y i ( n ) − y i t a r g e t ( n ) ) 2 = N y 1 n = 1 ∑ T T 1 ∥ y i − y i t a r g e t ∥ 2 2
回声状态网络的结构和循环神经网络(RNN)类似, 它的循环层(储备池 , Reservoir)的更新式如下
x ~ ( n ) = t a n h ( W i n ⋅ [ 1 ; u ( n ) ] + W ⋅ x ( n − 1 ) ) , x ( n ) = ( 1 − α ) ⋅ x ( n − 1 ) + α ⋅ x ~ ( n ) \begin{gather}
\tilde{\boldsymbol{x}}(n) = {\rm tanh}(\boldsymbol{W}^{in}\cdot[1;\boldsymbol{u}(n)]+\boldsymbol{W}\cdot\boldsymbol{x}(n-1)),
\\
\boldsymbol{x}(n) = (1-\alpha)\cdot\boldsymbol{x}(n-1)+\alpha\cdot\tilde{\boldsymbol{x}}(n)
\end{gather}
x ~ ( n ) = tanh ( W in ⋅ [ 1 ; u ( n )] + W ⋅ x ( n − 1 )) , x ( n ) = ( 1 − α ) ⋅ x ( n − 1 ) + α ⋅ x ~ ( n )
其中x ( n ) ∈ R N x \boldsymbol{x}(n)\in\mathbb{R}^{N_x} x ( n ) ∈ R N x 是n n n 时刻储备池神经元的激活向量(对应RNN中的循环层的状态); x ~ ( n ) ∈ R N x \tilde{\boldsymbol{x}}(n)\in\mathbb{R}^{N_x} x ~ ( n ) ∈ R N x 是一个中间量, 用于求x ( n ) \boldsymbol{x}(n) x ( n ) 的泄露整合(leaky integration ); t a n h ( ⋅ ) {\rm tanh(\cdot)} tanh ( ⋅ ) 是采用的激活函数, 这里是按元素进行计算的(element-wise ); [ ⋅ ; ⋅ ] [\cdot;\cdot] [ ⋅ ; ⋅ ] 代表向量在竖直方向的连接, [ 1 ; u ( n ) ] [1;\boldsymbol{u}(n)] [ 1 ; u ( n )] 这样表示是为了省略偏置项b i n \boldsymbol{b}^{in} b in , 因为这样在矩阵W i n ∈ R N x × ( 1 + N u ) \boldsymbol{W}^{in}\in\mathbb{R}^{N_x\times(1+N_{u})} W in ∈ R N x × ( 1 + N u ) 第一列中已经隐含了偏置项; W i n ∈ R N x × ( 1 + N u ) \boldsymbol{W}^{in}\in\mathbb{R}^{N_x\times(1+N_{u})} W in ∈ R N x × ( 1 + N u ) 是输入系数矩阵, W ∈ R N x × N x \boldsymbol{W}\in\mathbb{R}^{N_x\times N_x} W ∈ R N x × N x 是循环系数矩阵
这里采用泄露整合(leaky integration )来更新x ( n ) \boldsymbol{x}(n) x ( n ) , 其中参数α ∈ ( 0 , 1 ] \alpha\in(0,1] α ∈ ( 0 , 1 ] 称为泄露率; 当α = 1 \alpha=1 α = 1 时x ( n ) ≡ x ~ ( n ) \boldsymbol{x}(n)\equiv\tilde{\boldsymbol{x}}(n) x ( n ) ≡ x ~ ( n ) 是模型的一种特殊情况, 此时模型不采用泄露整合.
输出层
输出层如下定义
y ( n ) = W o u t ⋅ [ 1 ; u ( n ) ; x ( n ) ] , \boldsymbol{y}(n) =
\boldsymbol{W}^{out}\cdot[1;\boldsymbol{u}(n);\boldsymbol{x}(n)],
y ( n ) = W o u t ⋅ [ 1 ; u ( n ) ; x ( n )] ,
其中y ( n ) ∈ R N y \boldsymbol{y}(n)\in\mathbb{R}^{N_y} y ( n ) ∈ R N y 是网络的输出, W o u t ∈ R N y × ( 1 + N u + N x ) \boldsymbol{W}^{out}\in\mathbb{R}^{N_y\times(1+N_u+N_x)} W o u t ∈ R N y × ( 1 + N u + N x ) 是输出权重矩阵, [ ⋅ ; ⋅ ; ⋅ ] [\cdot;\cdot;\cdot] [ ⋅ ; ⋅ ; ⋅ ] 还是代表向量在竖直方向的连接.
网络的整体结构
y ( n ) = W o u t ⋅ [ 1 ; u ( n ) ; x ( n ) ] x ( n ) = t a n h ( W i n [ 1 ; u ( n ) ] + W x ( n − 1 ) ) = t a n h ( W i n [ 1 ; u ( n ) ] + W t a n h ( W i n [ 1 ; u ( n − 1 ) ] + W x ( n − 2 ) ) ) ⋮ = t a n h ( W i n [ 1 ; u ( n ) ] + W t a n h ( W i n [ 1 ; u ( n − 1 ) ] + ⋯ + W t a n h ( W i n [ 1 ; u ( 1 ) ] + W x ( 0 ) ) ) ) \begin{split}
\boldsymbol{y}(n) &=
\boldsymbol{W}^{out}\cdot[1;\boldsymbol{u}(n);\boldsymbol{x}(n)]
\\\\
\boldsymbol{x}(n) &= {\rm tanh}\left( \boldsymbol{W}^{in} [1;\boldsymbol{u}(n)]+\boldsymbol{W}\boldsymbol{x}(n-1)\right)
\\
&=
{\rm tanh}\left( \boldsymbol{W}^{in} [1;\boldsymbol{u}(n)]+\boldsymbol{W}
{\rm tanh}\left( \boldsymbol{W}^{in} [1;\boldsymbol{u}(n-1)]+\boldsymbol{W}\boldsymbol{x}(n-2)\right)\right)
\\
&\vdots
\\
&=
{\rm tanh}\left( \boldsymbol{W}^{in} [1;\boldsymbol{u}(n)]+\boldsymbol{W}
{\rm tanh}\left( \boldsymbol{W}^{in} [1;\boldsymbol{u}(n-1)]+\cdots+\boldsymbol{W}{\rm tanh}\left( \boldsymbol{W}^{in} [1;\boldsymbol{u}(1)]+\boldsymbol{W}\boldsymbol{x}(0)\right)\right)\right)
\end{split}
y ( n ) x ( n ) = W o u t ⋅ [ 1 ; u ( n ) ; x ( n )] = tanh ( W in [ 1 ; u ( n )] + W x ( n − 1 ) ) = tanh ( W in [ 1 ; u ( n )] + W tanh ( W in [ 1 ; u ( n − 1 )] + W x ( n − 2 ) ) ) ⋮ = tanh ( W in [ 1 ; u ( n )] + W tanh ( W in [ 1 ; u ( n − 1 )] + ⋯ + W tanh ( W in [ 1 ; u ( 1 )] + W x ( 0 ) ) ) )
可以看到网络在n n n 时刻的输出y ( n ) \boldsymbol{y}(n) y ( n ) 是和前n n n 个时刻的输入u ( n ) , u ( n − 1 ) , ⋯ , u ( 1 ) \boldsymbol{u}(n),\boldsymbol{u}(n-1),\cdots,\boldsymbol{u}(1) u ( n ) , u ( n − 1 ) , ⋯ , u ( 1 ) 和初始状态x ( 0 ) \boldsymbol{x}(0) x ( 0 ) 相关的.
储备池的生成
为了生成一个"好的"储备池, 首先我们要了解储备池的作用是什么.
储备池的作用
将输入向量非线性映射到高维空间
存储网络的状态(隐含了前面的所有输入数据)
各时刻的输入对结果的影响
记n n n 时刻的输出y ( n ) \boldsymbol{y}(n) y ( n ) 和目标y t a r g e t ( n ) \boldsymbol{y}^{target}(n) y t a r g e t ( n ) 的误差为E ( n ) E(n) E ( n )
E ( n ) = ∥ y ( n ) − y t a r g e t ( n ) ∥ 2 2 E(n) = \|\boldsymbol{y}(n)-\boldsymbol{y}^{target}(n)\|_2^2
E ( n ) = ∥ y ( n ) − y t a r g e t ( n ) ∥ 2 2
∂ E ( n ) ∂ u ( n − k ) = ∂ E ( n ) ∂ x ( n ) ∂ x ( n ) ∂ x ( n − 1 ) ∂ x ( n − 1 ) ∂ x ( n − 2 ) ⋯ ∂ x ( n − k + 1 ) ∂ x ( n − k ) ∂ x ( n − k ) ∂ u ( n − k ) k = 0 , 1 , 2 , ⋯ , n − 1 \frac{\partial E(n)}{\partial \boldsymbol{u}(n-k)}=
\frac{\partial E(n)}{\partial \boldsymbol{x}(n)}
\frac{\partial \boldsymbol{x}(n)}{\partial \boldsymbol{x}(n-1)}
\frac{\partial \boldsymbol{x}(n-1)}{\partial \boldsymbol{x}(n-2)}
\cdots
\frac{\partial \boldsymbol{x}(n-k+1)}{\partial \boldsymbol{x}(n-k)}
\frac{\partial \boldsymbol{x}(n-k)}{\partial \boldsymbol{u}(n-k)}
\qquad
k=0,1,2,\cdots,n-1
∂ u ( n − k ) ∂ E ( n ) = ∂ x ( n ) ∂ E ( n ) ∂ x ( n − 1 ) ∂ x ( n ) ∂ x ( n − 2 ) ∂ x ( n − 1 ) ⋯ ∂ x ( n − k ) ∂ x ( n − k + 1 ) ∂ u ( n − k ) ∂ x ( n − k ) k = 0 , 1 , 2 , ⋯ , n − 1
∂ E ( n ) ∂ x ( n ) = ∂ E ( n ) ∂ y ( n ) ∂ y ( n ) ∂ x ( n ) = 2 [ y 1 ( n ) − y 1 t a r g e t ( n ) , ⋯ , y N y ( n ) − y N y t a r g e t ( n ) ] W x o u t \begin{split}
\frac{\partial E(n)}{\partial \boldsymbol{x}(n)}
=\ \frac{\partial E(n)}{\partial \boldsymbol{y}(n)}
\frac{\partial \boldsymbol{y}(n)}{\partial \boldsymbol{x}(n)}
= 2\begin{bmatrix}
y_1(n)-y^{target}_1(n),&\cdots&,y_{N_y}(n)-y^{target}_{N_y}(n)
\end{bmatrix}
\boldsymbol{W}^{out}_x
\end{split}
∂ x ( n ) ∂ E ( n ) = ∂ y ( n ) ∂ E ( n ) ∂ x ( n ) ∂ y ( n ) = 2 [ y 1 ( n ) − y 1 t a r g e t ( n ) , ⋯ , y N y ( n ) − y N y t a r g e t ( n ) ] W x o u t
∂ x ( n − i + 1 ) ∂ x ( n − i ) = ∂ x ( n − i + 1 ) ∂ η ( n − i + 1 ) ∂ η ( n − i + 1 ) ∂ x ( n − i ) = [ ∂ x 1 ( n − i + 1 ) ∂ η 1 ( n − i + 1 ) 0 ⋯ 0 0 ∂ x 2 ( n − i + 1 ) ∂ η 2 ( n − i + 1 ) ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ ∂ x N x ( n − i + 1 ) ∂ η N x ( n − i + 1 ) ] W i = 1 , 2 , ⋯ , k \begin{split}
\frac{\partial \boldsymbol{x}(n-i+1)}{\partial \boldsymbol{x}(n-i)}
&=
\frac{\partial \boldsymbol{x}(n-i+1)}{\partial \boldsymbol{\eta}(n-i+1)}
\frac{\partial \boldsymbol{\eta}(n-i+1)}{\partial \boldsymbol{x}(n-i)}
\\
&=
\begin{bmatrix}
\frac{\partial x_{1}(n-i+1)}{\partial \eta_{1}(n-i+1)}&0&\cdots&0\\
0&\frac{\partial x_{2}(n-i+1)}{\partial \eta_{2}(n-i+1)}&\cdots&0\\
\vdots&\vdots&\ddots&\vdots\\
0&0&\cdots&\frac{\partial x_{N_x}(n-i+1)}{\partial \eta_{N_x}(n-i+1)}
\end{bmatrix}
\boldsymbol{W}
\\\\
i&=1,2,\cdots,k
\end{split}
∂ x ( n − i ) ∂ x ( n − i + 1 ) i = ∂ η ( n − i + 1 ) ∂ x ( n − i + 1 ) ∂ x ( n − i ) ∂ η ( n − i + 1 ) = ∂ η 1 ( n − i + 1 ) ∂ x 1 ( n − i + 1 ) 0 ⋮ 0 0 ∂ η 2 ( n − i + 1 ) ∂ x 2 ( n − i + 1 ) ⋮ 0 ⋯ ⋯ ⋱ ⋯ 0 0 ⋮ ∂ η N x ( n − i + 1 ) ∂ x N x ( n − i + 1 ) W = 1 , 2 , ⋯ , k
∂ x ( n − k ) ∂ u ( n − k ) = ∂ x ( n − k ) ∂ η ( n − k ) ∂ η ( n − k ) ∂ u ( n − k ) = [ ∂ x 1 ( n − k ) ∂ η 1 ( n − k ) 0 ⋯ 0 0 ∂ x 2 ( n − k ) ∂ η 2 ( n − k ) ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ ∂ x N x ( n − k ) ∂ η N x ( n − k ) ] W u o u t \begin{split}
\frac{\partial \boldsymbol{x}(n-k)}{\partial \boldsymbol{u}(n-k)}
&=
\frac{\partial \boldsymbol{x}(n-k)}{\partial \boldsymbol{\eta}(n-k)}
\frac{\partial \boldsymbol{\eta}(n-k)}{\partial \boldsymbol{u}(n-k)}
\\
&=
\begin{bmatrix}
\frac{\partial x_{1}(n-k)}{\partial \eta_{1}(n-k)}&0&\cdots&0\\
0&\frac{\partial x_{2}(n-k)}{\partial \eta_{2}(n-k)}&\cdots&0\\
\vdots&\vdots&\ddots&\vdots\\
0&0&\cdots&\frac{\partial x_{N_x}(n-k)}{\partial \eta_{N_x}(n-k)}
\end{bmatrix}
\boldsymbol{W}^{out}_u
\end{split}
∂ u ( n − k ) ∂ x ( n − k ) = ∂ η ( n − k ) ∂ x ( n − k ) ∂ u ( n − k ) ∂ η ( n − k ) = ∂ η 1 ( n − k ) ∂ x 1 ( n − k ) 0 ⋮ 0 0 ∂ η 2 ( n − k ) ∂ x 2 ( n − k ) ⋮ 0 ⋯ ⋯ ⋱ ⋯ 0 0 ⋮ ∂ η N x ( n − k ) ∂ x N x ( n − k ) W u o u t
这里采用分子布局进行表示
于是有
∂ E ( n ) ∂ u ( n − k ) = ∂ E ( n ) ∂ x ( n ) ∏ i = 1 k ( [ ∂ x 1 ( n − i + 1 ) ∂ η 1 ( n − i + 1 ) ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ ∂ x N x ( n − i + 1 ) ∂ η N x ( n − i + 1 ) ] W ) ∂ x ( n − k ) ∂ u ( n − k ) \frac{\partial E(n)}{\partial \boldsymbol{u}(n-k)}=
\frac{\partial E(n)}{\partial \boldsymbol{x}(n)}
\prod_{i=1}^{k}
\left(
\begin{bmatrix}
\frac{\partial x_{1}(n-i+1)}{\partial \eta_{1}(n-i+1)}&\cdots&0\\
\vdots&\ddots&\vdots\\
0&\cdots&\frac{\partial x_{N_x}(n-i+1)}{\partial \eta_{N_x}(n-i+1)}
\end{bmatrix}
\boldsymbol{W}
\right)
\frac{\partial \boldsymbol{x}(n-k)}{\partial \boldsymbol{u}(n-k)}
∂ u ( n − k ) ∂ E ( n ) = ∂ x ( n ) ∂ E ( n ) i = 1 ∏ k ∂ η 1 ( n − i + 1 ) ∂ x 1 ( n − i + 1 ) ⋮ 0 ⋯ ⋱ ⋯ 0 ⋮ ∂ η N x ( n − i + 1 ) ∂ x N x ( n − i + 1 ) W ∂ u ( n − k ) ∂ x ( n − k )
输出层的训练
岭回归
因为ESN的输出层是一个典型的线性前馈结构, 其表达式可以写成矩阵的形式如下
Y = W o u t X , \boldsymbol{Y} = \boldsymbol{W}^{out}\boldsymbol{X},
Y = W o u t X ,
其中Y ∈ R N y × T \boldsymbol{Y}\in\mathbb{R}^{N_y\times T} Y ∈ R N y × T 是所有的y ( n ) , n = 1 , ⋯ , T \boldsymbol{y}(n),n=1,\cdots,T y ( n ) , n = 1 , ⋯ , T , X ∈ R ( 1 + N u + N x ) × T \boldsymbol{X}\in\mathbb{R}^{(1+N_u+N_x)\times T} X ∈ R ( 1 + N u + N x ) × T 是所有的[ 1 ; u ( n ) ; x ( n ) ] , n = 1 , ⋯ , T [1;\boldsymbol{u}(n);\boldsymbol{x}(n)],n=1,\cdots,T [ 1 ; u ( n ) ; x ( n )] , n = 1 , ⋯ , T . 为书写方便起见, 这里只用一个X \boldsymbol{X} X 来表示[ 1 ; U ; X ] [1;\boldsymbol{U};\boldsymbol{X}] [ 1 ; U ; X ] .
我们的目标是找到一个最优的输出权重矩阵W o u t ∈ R N y × ( 1 + N u + N x ) \boldsymbol{W}^{out}\in\mathbb{R}^{N_y\times(1+N_u+N_x)} W o u t ∈ R N y × ( 1 + N u + N x ) 使得Y \boldsymbol{Y} Y 和Y t a r g e t \boldsymbol{Y}^{target} Y t a r g e t 之间的误差最小. 实际上这个问题等价于在求解一个超定线性方程组
Y t a r g e t = W o u t X \boldsymbol{Y}^{target} = \boldsymbol{W}^{out}\boldsymbol{X}
Y t a r g e t = W o u t X
其中Y t a r g e t ∈ R N y × T \boldsymbol{Y}^{target}\in\mathbb{R}^{N_y\times T} Y t a r g e t ∈ R N y × T 和X ∈ R ( 1 + N u + N x ) × T \boldsymbol{X}\in\mathbb{R}^{(1+N_u+N_x)\times T} X ∈ R ( 1 + N u + N x ) × T 是已知的, 求W o u t ∈ R N y × ( 1 + N u + N x ) \boldsymbol{W}^{out}\in\mathbb{R}^{N_y\times(1+N_u+N_x)} W o u t ∈ R N y × ( 1 + N u + N x ) 的最小二乘解. 这个方程组是超定的, 因为一般而言T ≫ 1 + N u + N x T\gg 1+N_u+N_x T ≫ 1 + N u + N x , 限定方程的个数要远大于待定参数的个数.
这里不是直接求最小二乘解, 我们给目标函数增加一个正则项
W o u t = a r g min W o u t ∑ n = 1 T ∑ i = 1 N y ( ( y i ( n ) − y i t a r g e t ( n ) ) 2 + β ( w i , n o u t ) 2 ) = a r g min W o u t ( ∥ Y − Y t a r g e t ∥ F 2 + β ∥ W o u t ∥ F 2 ) = a r g min W o u t ( ∥ W o u t X − Y t a r g e t ∥ F 2 + β ∥ W o u t ∥ F 2 ) \begin{split}
\boldsymbol{W}^{out} &= arg\min_{\boldsymbol{W}^{out}}
\sum_{n=1}^T\sum_{i=1}^{N_y}\left((y_i(n)-y_i^{target}(n))^2+\beta(\boldsymbol{w}^{out}_{i,n})^2\right)\\
&=
arg\min_{\boldsymbol{W}^{out}}\left(
\|\boldsymbol{Y}-\boldsymbol{Y}^{target}\|_F^2 + \beta\|\boldsymbol{W}^{out}\|_F^2\right)
\\
&=
arg\min_{\boldsymbol{W}^{out}}\left(
\|\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target}\|_F^2 + \beta\|\boldsymbol{W}^{out}\|_F^2\right)
\end{split}
W o u t = a r g W o u t min n = 1 ∑ T i = 1 ∑ N y ( ( y i ( n ) − y i t a r g e t ( n ) ) 2 + β ( w i , n o u t ) 2 ) = a r g W o u t min ( ∥ Y − Y t a r g e t ∥ F 2 + β ∥ W o u t ∥ F 2 ) = a r g W o u t min ( ∥ W o u t X − Y t a r g e t ∥ F 2 + β ∥ W o u t ∥ F 2 )
其中w i , n o u t \boldsymbol{w}_{i,n}^{out} w i , n o u t 表示W o u t \boldsymbol{W}^{out} W o u t 的第i i i 行第n n n 列的元素
∥ ⋅ ∥ F \|\cdot\|_F ∥ ⋅ ∥ F 代表Frobenius范数, 对于矩阵A = [ a i j ] m × n A=[a_{ij}]_{m\times n} A = [ a ij ] m × n , 其F-范数定义为∥ A ∥ F = t r ( A T A ) = ∑ i = 1 m ∑ j = 1 n a i j 2 \|A\|_F = \sqrt{tr(A^TA)}=\sqrt{\sum_{i=1}^m\sum_{j=1}^n a_{ij}^2} ∥ A ∥ F = t r ( A T A ) = ∑ i = 1 m ∑ j = 1 n a ij 2 .
对上面的式子关于W o u t \boldsymbol{W}^{out} W o u t 求偏导, 由矩阵的求导公式有
∂ ∥ W o u t X − Y t a r g e t ∥ F 2 ∂ W o u t = ∂ t r ( ( W o u t X − Y t a r g e t ) T ( W o u t X − Y t a r g e t ) ) ∂ W o u t = ∂ t r ( ( W o u t X − Y t a r g e t ) T ( W o u t X − Y t a r g e t ) ) ∂ t r ( W o u t X − Y t a r g e t ) ∂ t r ( W o u t X − Y t a r g e t ) ∂ W o u t = ( W o u t X − Y t a r g e t ) X T ∂ ∥ W o u t ∥ F 2 ∂ W o u t = ∂ t r ( ( W o u t ) T ( W o u t ) ) ∂ W o u t = W o u t \begin{split}
\frac{\partial \|\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target}\|_F^2}{\partial \boldsymbol{W}^{out}}
&=
\frac{\partial {\rm tr}\left((\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target})^T(\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target})\right)}{\partial \boldsymbol{W}^{out}}
\\
&=
\frac{\partial {\rm tr}\left((\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target})^T(\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target})\right)}{\partial {\rm tr}\left(\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target}\right)}
\frac{\partial {\rm tr}\left(\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target}\right)}{\partial \boldsymbol{W}^{out}}
\\
&=
(\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target})\boldsymbol{X}^T
\\
\notag
\\
\frac{\partial \|\boldsymbol{W}^{out}\|_F^2}{\partial \boldsymbol{W}^{out}}
&=
\frac{\partial {\rm tr}\left((\boldsymbol{W}^{out})^T(\boldsymbol{W}^{out})\right)}{\partial \boldsymbol{W}^{out}}
= \boldsymbol{W}^{out}
\end{split}
∂ W o u t ∂ ∥ W o u t X − Y t a r g e t ∥ F 2 ∂ W o u t ∂ ∥ W o u t ∥ F 2 = ∂ W o u t ∂ tr ( ( W o u t X − Y t a r g e t ) T ( W o u t X − Y t a r g e t ) ) = ∂ tr ( W o u t X − Y t a r g e t ) ∂ tr ( ( W o u t X − Y t a r g e t ) T ( W o u t X − Y t a r g e t ) ) ∂ W o u t ∂ tr ( W o u t X − Y t a r g e t ) = ( W o u t X − Y t a r g e t ) X T = ∂ W o u t ∂ tr ( ( W o u t ) T ( W o u t ) ) = W o u t
令
( W o u t X − Y t a r g e t ) X T + β W o u t = 0 (\boldsymbol{W}^{out}\boldsymbol{X}-\boldsymbol{Y}^{target})\boldsymbol{X}^T+\beta\boldsymbol{W}^{out} = \boldsymbol{0}
( W o u t X − Y t a r g e t ) X T + β W o u t = 0
得到
W o u t = Y t a r g e t X T ( X X T + β I ) − 1 \boldsymbol{W}^{out} = \boldsymbol{Y}^{target}\boldsymbol{X}^T\left
(\boldsymbol{X}\boldsymbol{X}^T+\beta\boldsymbol{I}\right)^{-1}
W o u t = Y t a r g e t X T ( X X T + β I ) − 1
其中当β = 0 \beta=0 β = 0 时, W o u t \boldsymbol{W}^{out} W o u t 对应的结果就是一般的最小二乘解.
在时间序列上的应用
Mackey-Glass时间序列
Mackey-Glass方程是由Michael Mackey和Leon Glass在1977年在Science上发表的论文生理控制系统的振荡和混沌(Oscillation and chaos in physiological control systems )中提出的, 该方程为非线性时滞微分方程, 它的一般形式如下
d x ( t ) d t = β x ( t − τ ) 1 + x n ( t − τ ) − γ x ( t ) , β , γ , n > 0 , \frac{dx(t)}{dt} = \beta\frac{x(t-\tau)}{1+x^{n}(t-\tau)}-\gamma x(t), \qquad \beta,\gamma,n>0,
d t d x ( t ) = β 1 + x n ( t − τ ) x ( t − τ ) − γ x ( t ) , β , γ , n > 0 ,
其中β , γ , n \beta,\gamma,n β , γ , n 都是正的实参数, τ \tau τ 表示相位差.
我们取参数β = 0.2 , γ = 0.1 , n = 10 \beta=0.2,\gamma=0.1,n=10 β = 0.2 , γ = 0.1 , n = 10 , 得到方程的一种形式
d x ( t ) d t = 0.2 x ( t − τ ) 1 + x 10 ( t − τ ) − 0.1 x ( t ) \frac{dx(t)}{dt} = 0.2\frac{x(t-\tau)}{1+x^{10}(t-\tau)}-0.1 x(t)
d t d x ( t ) = 0.2 1 + x 10 ( t − τ ) x ( t − τ ) − 0.1 x ( t )
在这个形式下, 当相位差τ ≥ 17 \tau\ge 17 τ ≥ 17 时, 由方程构造的序列是无周期的, 混沌的并且不随时间收敛和也不随时间发散.
我们这这里使用欧拉方法对Mackey-Glass方程进行数值求解, 生成时间序列. 取相位差τ = 17 \tau=17 τ = 17 , 初始值由余弦函数如下生成
x ( 0 ) = cos ( 0 ) , x ( 1 ) = cos ( 1 ) , ⋯ , x ( 17 ) = c o s ( 17 ) \begin{equation*}
x(0)=\cos(0),\quad x(1)=\cos(1)\quad ,\cdots,\quad x(17)=cos(17)
\end{equation*}
x ( 0 ) = cos ( 0 ) , x ( 1 ) = cos ( 1 ) , ⋯ , x ( 17 ) = cos ( 17 )
递推公式如下
{ x ( n + 1 ) = x ( n ) + h ( 0.2 x ( t − 17 ) 1 + x 10 ( t − 17 ) − 0.1 x ( n ) ) , n ≥ 17 x ( n ) = cos ( n ) , 0 ≤ n ≤ 17 \begin{cases}
x(n+1) = x(n) + h\left(0.2\frac{x(t-17)}{1+x^{10}(t-17)}-0.1 x(n)\right),& n
\ge17\\
x(n) = \cos(n),&0\le n\le17
\end{cases}
{ x ( n + 1 ) = x ( n ) + h ( 0.2 1 + x 10 ( t − 17 ) x ( t − 17 ) − 0.1 x ( n ) ) , x ( n ) = cos ( n ) , n ≥ 17 0 ≤ n ≤ 17
其中h h h 是时间步长, 这里取h = 1 h=1 h = 1
由此我们得到数据集, 下图是部分样本点的展示
下面是生成数据集的相图
可以看出数据不呈现任何周期性, 不收敛也不发散, 在一定的区间上呈现混沌状态
使用ESN预测Mackey-Glass时间序列
Python实现
1 2 3 4 5 trainLen = 2000 testLen = 2000 initLen = 100 data = np.loadtxt('MackeyGlass_t17.txt' )
1 2 3 inSize = outSize = 1 resSize = 1000 a = 0.3
生成[ − 0.5 , 0.5 ) [-0.5,0.5) [ − 0.5 , 0.5 ) 上均匀分布的随机矩阵
1 2 Win = np.random.rand(resSize,1 +inSize) - 0.5 W = np.random.rand(resSize,resSize) - 0.5
计算循环矩阵的谱半径并对其进行放缩
1 2 rhoW = max (abs (linalg.eig(W)[0 ])) W *= 1.25 / rhoW
使用下一时刻的数据作为上一时刻输入的目标值
y t a r g e t ( n ) = u ( n + 1 ) \begin{equation*}
\boldsymbol{y}^{target}(n) = \boldsymbol{u}(n+1)
\end{equation*}
y t a r g e t ( n ) = u ( n + 1 )
1 2 Yt = data[None ,initLen+1 :trainLen+1 ]
按照下面的公式计算储备池的输出
x ~ ( n ) = t a n h ( W i n ⋅ [ 1 ; u ( n ) ] + W ⋅ x ( n − 1 ) ) , x ( n ) = ( 1 − α ) ⋅ x ( n − 1 ) + α ⋅ x ~ ( n ) \begin{gather*}
\tilde{\boldsymbol{x}}(n) = {\rm tanh}(\boldsymbol{W}^{in}\cdot[1;\boldsymbol{u}(n)]+\boldsymbol{W}\cdot\boldsymbol{x}(n-1)),
\\
\boldsymbol{x}(n) = (1-\alpha)\cdot\boldsymbol{x}(n-1)+\alpha\cdot\tilde{\boldsymbol{x}}(n)
\end{gather*}
x ~ ( n ) = tanh ( W in ⋅ [ 1 ; u ( n )] + W ⋅ x ( n − 1 )) , x ( n ) = ( 1 − α ) ⋅ x ( n − 1 ) + α ⋅ x ~ ( n )
1 2 3 4 5 6 7 8 9 10 X = np.zeros((1 +inSize+resSize,trainLen-initLen)) x = np.zeros((resSize,1 )) for t in range (trainLen): u = data[t] x = (1 -a)*x + a*np.tanh( np.dot( Win, np.vstack((1 ,u)) ) + np.dot( W, x ) ) if t >= initLen: X[:,t-initLen] = np.vstack((1 ,u,x))[:,0 ]
根据公式计算输出矩阵的岭回归最小二乘解
W o u t = Y t a r g e t X T ( X X T + β I ) − 1 \begin{equation*}
\boldsymbol{W}^{out} = \boldsymbol{Y}^{target}\boldsymbol{X}^T\left
(\boldsymbol{X}\boldsymbol{X}^T+\beta\boldsymbol{I}\right)^{-1}
\end{equation*}
W o u t = Y t a r g e t X T ( X X T + β I ) − 1
1 2 3 4 5 6 reg = 1e-8 X_T = X.T Wout = np.dot( np.dot(Yt,X_T), linalg.inv( np.dot(X,X_T) + \ reg*np.eye(1 +inSize+resSize) ) )
紧接着训练集, 将训练集最后一个数据的下一个数据作为网络的输入
将网络当前时刻的输出作为下一时刻的输入
1 2 3 4 5 6 7 8 Y = np.zeros((outSize,testLen)) for t in range (testLen): x = (1 -a)*x + a*np.tanh( np.dot( Win, np.vstack((1 ,u)) ) + np.dot( W, x ) ) y = np.dot( Wout, np.vstack((1 ,u,x)) ) Y[:,t] = y u = y
计算前300个点的均方误差
1 2 3 4 errorLen = 300 mse = sum ( np.square( data[trainLen+1 :trainLen+errorLen+1 ] - Y[0 ,0 :errorLen] ) ) / errorLen print ('MSE = ' + str ( mse ))
对比预测序列和目标序列的差异
从上图可以看出模型对时间序列前300个点的预测都呈现良好的效果, 而随着时间的推移预测值逐渐偏离目标值.