使用Backpropagation calculus | Deep learning, chapter 4中的符号,我有这个用于4层(即2个隐藏层)神经网络的反向传播代码:
def sigmoid_prime(z): return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))def train(self, input_vector, target_vector): a = np.array(input_vector, ndmin=2).T y = np.array(target_vector, ndmin=2).T # forward A = [a] for k in range(3): a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity A.append(a) # Now A has 4 elements: the input vector + the 3 outputs vectors # back-propagation delta = a - y for k in [2, 1, 0]: tmp = delta * sigmoid_prime(A[k+1]) delta = np.dot(self.weights[k].T, tmp) # (1) <---- HERE self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
它可以工作,但是:
-
最终的准确率(对于我的用例:MNIST数字识别)只是还可以,但不是很好。当用代码行(1)替换为以下代码时,效果会更好(即收敛性更好):
delta = np.dot(self.weights[k].T, delta) # (2)
-
来自Machine Learning with Python: Training and Testing the Neural Network with MNIST data set的代码也建议使用:
delta = np.dot(self.weights[k].T, delta)
而不是:
delta = np.dot(self.weights[k].T, tmp)
(根据本文的符号,它是:
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
)
这两个论点似乎是一致的:代码(2)比代码(1)更好。
然而,数学似乎显示了相反的情况(参见此视频;另一个细节:请注意,我的损失函数乘以了1/2,而视频中没有):
问题:哪个是正确的:实现(1)还是(2)?
用LaTeX表示:
$$C = \frac{1}{2} (a^L - y)^2$$$$a^L = \sigma(\underbrace{w^L a^{L-1} + b^L}_{z^L}) = \sigma(z^L)$$$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y)$$
回答:
我花了两天时间分析这个问题,我在笔记本上填写了几页偏导数计算…我可以确认:
- 问题中用LaTeX写的数学是正确的
-
代码(1)是正确的,它与数学计算一致:
delta = a - yfor k in [2, 1, 0]: tmp = delta * sigmoid_prime(A[k+1]) delta = np.dot(self.weights[k].T, tmp) self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
-
代码(2)是错误的:
delta = a - yfor k in [2, 1, 0]: tmp = delta * sigmoid_prime(A[k+1]) delta = np.dot(self.weights[k].T, delta) # WRONG HERE self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
并且在Machine Learning with Python: Training and Testing the Neural Network with MNIST data set中有一个小的错误:
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
应该改为
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))
现在难点是我花了几天时间才意识到:
-
显然代码(2)的收敛性比代码(1)好得多,这就是为什么我误以为代码(2)是正确的而代码(1)是错误的
-
…但实际上这只是一个巧合,因为
learning_rate
设置得太低了。原因如下:当使用代码(2)时,参数delta
的增长速度远快于使用代码(1)时(print np.linalg.norm(delta)
有助于观察到这一点)。 -
因此,“不正确的代码(2)”只是通过具有更大的
delta
参数来补偿“学习率过低”,在某些情况下,这导致了表面上更快的收敛。
现在问题解决了!