使用有限差分方法检查神经网络梯度无效

经过一整周的打印语句、维度分析、重构以及大声朗读代码后,我可以说我完全卡住了

我的成本函数生成的梯度与有限差分方法生成的梯度相差太大。

我已经确认我的成本函数对正则化输入和非正则化输入都能产生正确的成本。以下是成本函数的代码:

def nnCost(nn_params, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels):  # reshape parameter/weight vectors to suit network size  Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)], (hidden_layer_size, (input_layer_size + 1)))  Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size+1)):], (num_labels, (hidden_layer_size + 1)))  if lambda_ is None:    lambda_ = 0  # grab number of observations  m = X.shape[0]    # init variables we must return  cost = 0  Theta1_grad = np.zeros(Theta1.shape)  Theta2_grad = np.zeros(Theta2.shape)  # one-hot encode the vector y  y_mtx = pd.get_dummies(y.ravel()).to_numpy()   ones = np.ones((m, 1))  X = np.hstack((ones, X))    # layer 1  a1 = X  z2 = [email protected]  # layer 2  ones_l2 = np.ones((y.shape[0], 1))  a2 = np.hstack((ones_l2, sigmoid(z2.T)))  z3 = [email protected]  # layer 3  a3 = sigmoid(z3)  reg_term = (lambda_/(2*m)) * (np.sum(np.sum(np.multiply(Theta1, Theta1))) + np.sum(np.sum(np.multiply(Theta2,Theta2))) - np.subtract((Theta1[:,0].T@Theta1[:,0]),(Theta2[:,0].T@Theta2[:,0])))  cost = (1/m) * np.sum((-np.log(a3).T * (y_mtx) - np.log(1-a3).T * (1-y_mtx))) + reg_term    # BACKPROPAGATION  # δ3 equals the difference between a3 and the y_matrix  d3 = a3 - y_mtx.T  # δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units) multiplied element-wise by the g′() of z2 (computed back in Step 2).  d2 = Theta2[:,1:].T@d3 * sigmoidGradient(z2)  # Δ1 equals the product of δ2 and a1.  Delta1 = d2@a1  Delta1 /= m  # Δ2 equals the product of δ3 and a2.  Delta2 = d3@a2  Delta2 /= m    reg_term1 = (lambda_/m) * np.append(np.zeros((Theta1.shape[0],1)), Theta1[:,1:], axis=1)  reg_term2 = (lambda_/m) * np.append(np.zeros((Theta2.shape[0],1)), Theta2[:,1:], axis=1)    Theta1_grad = Delta1 + reg_term1  Theta2_grad = Delta2 + reg_term2    grad = np.append(Theta1_grad.ravel(), Theta2_grad.ravel())    return cost, grad

以下是检查梯度的代码。我已经逐行检查过,没有任何我能想到的需要更改的地方。看起来运行正常。

def checkNNGradients(lambda_):  """  Creates a small neural network to check the backpropagation gradients.   Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.    Input: Regularization parameter, lambda, as int or float.    Output: Analytical gradients produced by backprop code and the numerical gradients (computed  using computeNumericalGradient). These two gradient computations should result in   very similar values.   """  input_layer_size = 3  hidden_layer_size = 5  num_labels = 3  m = 5  # generate 'random' test data  Theta1 = debugInitializeWeights(hidden_layer_size, input_layer_size)  Theta2 = debugInitializeWeights(num_labels, hidden_layer_size)  # reusing debugInitializeWeights to generate X  X  = debugInitializeWeights(m, input_layer_size - 1)  y  = np.ones(m) + np.remainder(np.range(m), num_labels)  # unroll parameters  nn_params = np.append(Theta1.ravel(), Theta2.ravel())  costFunc = lambda p: nnCost(p, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels)      cost, grad = costFunc(nn_params)      numgrad = computeNumericalGradient(costFunc, nn_params)  # examine the two gradient computations; two columns should be very similar.   print('The columns below should be very similar.\n')     # Credit: http://stackoverflow.com/a/27663954/583834  print('{:<25}{}'.format('Numerical Gradient', 'Analytical Gradient'))  for numerical, analytical in zip(numgrad, grad):    print('{:<25}{}'.format(numerical, analytical))  # If you have a correct implementation, and assuming you used EPSILON = 0.0001   # in computeNumericalGradient.m, then diff below should be less than 1e-9  diff = np.linalg.norm(numgrad-grad)/np.linalg.norm(numgrad+grad)  print(diff)  print("\n")  print('If your backpropagation implementation is correct, then \n' \          'the relative difference will be small (less than 1e-9). \n' \          '\nRelative Difference: {:.10f}'.format(diff))

检查函数使用debugInitializeWeights函数生成自己的数据(因此这是一个可复现的例子;只要运行它,它就会调用其他函数),然后调用使用有限差分计算梯度的函数。两者如下所示:

def debugInitializeWeights(fan_out, fan_in):  """  Initializes the weights of a layer with fan_in  incoming connections and fan_out outgoing connections using a fixed  strategy.  Input: fan_out, number of outgoing connections for a layer as int; fan_in, number  of incoming connections for the same layer as int.     Output: Weight matrix, W, of size(1 + fan_in, fan_out), as the first row of W handles the "bias" terms  """  W = np.zeros((fan_out, 1 + fan_in))  # Initialize W using "sin", this ensures that the values in W are of similar scale;  # this will be useful for debugging  W = np.sin(range(1, np.size(W)+1)) / 10   return W.reshape(fan_out, fan_in+1)def computeNumericalGradient(J, nn_params):  """  Computes the gradient using "finite differences"  and provides a numerical estimate of the gradient (i.e.,  gradient of the function J around theta).  Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.   Inputs: Cost, J, as computed by nnCost function; Parameter vector, theta.  Output: Gradient vector using finite differences. Per Dr. Ng,   'Sets numgrad(i) to (a numerical approximation of) the partial derivative of   J with respect to the i-th input argument, evaluated at theta. (i.e., numgrad(i) should   be the (approximately) the partial derivative of J with respect  to theta(i).)'            """  numgrad = np.zeros(nn_params.shape)  perturb = np.zeros(nn_params.shape)  e = .0001  for i in range(np.size(nn_params)):      # Set perturbation (i.e., noise) vector      perturb[i] = e      # run cost fxn w/ noise added to and subtracted from parameters theta in nn_params      cost1, grad1 = J((nn_params - perturb))      cost2, grad2 = J((nn_params + perturb))      # record the difference in cost function ouputs; this is the numerical gradient      numgrad[i] = (cost2 - cost1) / (2*e)      perturb[i] = 0  return numgrad

这段代码不是为了课堂使用。那个MOOC课程是用MATLAB写的,现在已经结束了。这是为我自己做的。网上存在其他解决方案,但查看它们并没有什么帮助。每个人都有不同的(难以理解的)方法。所以,我非常需要帮助或奇迹。

编辑/更新:在展开向量时使用Fortran顺序会影响结果,但我无法通过更改该选项使梯度一致。


回答:

一个想法:我认为你的扰动有点大,达到1e-4。对于双精度浮点数,应该更接近1e-8,即机器精度的平方根(还是你在使用单精度?)。

话虽如此,有限差分可能对真实导数的近似非常差。具体来说,如你所发现的,numpy中的浮点计算并非确定性的。在某些情况下,评估中的噪声可能会抵消许多有效数字。你看到了什么值?你期望看到什么值?

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注