经过一整周的打印语句、维度分析、重构以及大声朗读代码后,我可以说我完全卡住了。
我的成本函数生成的梯度与有限差分方法生成的梯度相差太大。
我已经确认我的成本函数对正则化输入和非正则化输入都能产生正确的成本。以下是成本函数的代码:
def nnCost(nn_params, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels): # reshape parameter/weight vectors to suit network size Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)], (hidden_layer_size, (input_layer_size + 1))) Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size+1)):], (num_labels, (hidden_layer_size + 1))) if lambda_ is None: lambda_ = 0 # grab number of observations m = X.shape[0] # init variables we must return cost = 0 Theta1_grad = np.zeros(Theta1.shape) Theta2_grad = np.zeros(Theta2.shape) # one-hot encode the vector y y_mtx = pd.get_dummies(y.ravel()).to_numpy() ones = np.ones((m, 1)) X = np.hstack((ones, X)) # layer 1 a1 = X z2 = [email protected] # layer 2 ones_l2 = np.ones((y.shape[0], 1)) a2 = np.hstack((ones_l2, sigmoid(z2.T))) z3 = [email protected] # layer 3 a3 = sigmoid(z3) reg_term = (lambda_/(2*m)) * (np.sum(np.sum(np.multiply(Theta1, Theta1))) + np.sum(np.sum(np.multiply(Theta2,Theta2))) - np.subtract((Theta1[:,0].T@Theta1[:,0]),(Theta2[:,0].T@Theta2[:,0]))) cost = (1/m) * np.sum((-np.log(a3).T * (y_mtx) - np.log(1-a3).T * (1-y_mtx))) + reg_term # BACKPROPAGATION # δ3 equals the difference between a3 and the y_matrix d3 = a3 - y_mtx.T # δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units) multiplied element-wise by the g′() of z2 (computed back in Step 2). d2 = Theta2[:,1:].T@d3 * sigmoidGradient(z2) # Δ1 equals the product of δ2 and a1. Delta1 = d2@a1 Delta1 /= m # Δ2 equals the product of δ3 and a2. Delta2 = d3@a2 Delta2 /= m reg_term1 = (lambda_/m) * np.append(np.zeros((Theta1.shape[0],1)), Theta1[:,1:], axis=1) reg_term2 = (lambda_/m) * np.append(np.zeros((Theta2.shape[0],1)), Theta2[:,1:], axis=1) Theta1_grad = Delta1 + reg_term1 Theta2_grad = Delta2 + reg_term2 grad = np.append(Theta1_grad.ravel(), Theta2_grad.ravel()) return cost, grad
以下是检查梯度的代码。我已经逐行检查过,没有任何我能想到的需要更改的地方。看起来运行正常。
def checkNNGradients(lambda_): """ Creates a small neural network to check the backpropagation gradients. Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ. Input: Regularization parameter, lambda, as int or float. Output: Analytical gradients produced by backprop code and the numerical gradients (computed using computeNumericalGradient). These two gradient computations should result in very similar values. """ input_layer_size = 3 hidden_layer_size = 5 num_labels = 3 m = 5 # generate 'random' test data Theta1 = debugInitializeWeights(hidden_layer_size, input_layer_size) Theta2 = debugInitializeWeights(num_labels, hidden_layer_size) # reusing debugInitializeWeights to generate X X = debugInitializeWeights(m, input_layer_size - 1) y = np.ones(m) + np.remainder(np.range(m), num_labels) # unroll parameters nn_params = np.append(Theta1.ravel(), Theta2.ravel()) costFunc = lambda p: nnCost(p, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels) cost, grad = costFunc(nn_params) numgrad = computeNumericalGradient(costFunc, nn_params) # examine the two gradient computations; two columns should be very similar. print('The columns below should be very similar.\n') # Credit: http://stackoverflow.com/a/27663954/583834 print('{:<25}{}'.format('Numerical Gradient', 'Analytical Gradient')) for numerical, analytical in zip(numgrad, grad): print('{:<25}{}'.format(numerical, analytical)) # If you have a correct implementation, and assuming you used EPSILON = 0.0001 # in computeNumericalGradient.m, then diff below should be less than 1e-9 diff = np.linalg.norm(numgrad-grad)/np.linalg.norm(numgrad+grad) print(diff) print("\n") print('If your backpropagation implementation is correct, then \n' \ 'the relative difference will be small (less than 1e-9). \n' \ '\nRelative Difference: {:.10f}'.format(diff))
检查函数使用debugInitializeWeights
函数生成自己的数据(因此这是一个可复现的例子;只要运行它,它就会调用其他函数),然后调用使用有限差分计算梯度的函数。两者如下所示:
def debugInitializeWeights(fan_out, fan_in): """ Initializes the weights of a layer with fan_in incoming connections and fan_out outgoing connections using a fixed strategy. Input: fan_out, number of outgoing connections for a layer as int; fan_in, number of incoming connections for the same layer as int. Output: Weight matrix, W, of size(1 + fan_in, fan_out), as the first row of W handles the "bias" terms """ W = np.zeros((fan_out, 1 + fan_in)) # Initialize W using "sin", this ensures that the values in W are of similar scale; # this will be useful for debugging W = np.sin(range(1, np.size(W)+1)) / 10 return W.reshape(fan_out, fan_in+1)def computeNumericalGradient(J, nn_params): """ Computes the gradient using "finite differences" and provides a numerical estimate of the gradient (i.e., gradient of the function J around theta). Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ. Inputs: Cost, J, as computed by nnCost function; Parameter vector, theta. Output: Gradient vector using finite differences. Per Dr. Ng, 'Sets numgrad(i) to (a numerical approximation of) the partial derivative of J with respect to the i-th input argument, evaluated at theta. (i.e., numgrad(i) should be the (approximately) the partial derivative of J with respect to theta(i).)' """ numgrad = np.zeros(nn_params.shape) perturb = np.zeros(nn_params.shape) e = .0001 for i in range(np.size(nn_params)): # Set perturbation (i.e., noise) vector perturb[i] = e # run cost fxn w/ noise added to and subtracted from parameters theta in nn_params cost1, grad1 = J((nn_params - perturb)) cost2, grad2 = J((nn_params + perturb)) # record the difference in cost function ouputs; this is the numerical gradient numgrad[i] = (cost2 - cost1) / (2*e) perturb[i] = 0 return numgrad
这段代码不是为了课堂使用。那个MOOC课程是用MATLAB写的,现在已经结束了。这是为我自己做的。网上存在其他解决方案,但查看它们并没有什么帮助。每个人都有不同的(难以理解的)方法。所以,我非常需要帮助或奇迹。
编辑/更新:在展开向量时使用Fortran顺序会影响结果,但我无法通过更改该选项使梯度一致。
回答:
一个想法:我认为你的扰动有点大,达到1e-4
。对于双精度浮点数,应该更接近1e-8
,即机器精度的平方根(还是你在使用单精度?)。
话虽如此,有限差分可能对真实导数的近似非常差。具体来说,如你所发现的,numpy中的浮点计算并非确定性的。在某些情况下,评估中的噪声可能会抵消许多有效数字。你看到了什么值?你期望看到什么值?