尝试理解3层神经网络中的梯度检查错误

我正在用Python构建一个基本的3层神经网络。在编写了梯度函数后，我继续对其进行数值梯度的梯度检查。在得到一个较大的相对差异后，我展开了两者的梯度矩阵并进行并排比较。

Function Gradient      Numerical Gradient-0.000968788380809     0.0 0.0153540197907       0.0153540197889-0.00584391679274     -0.00584391679048-0.00490359558077     -0.00490359558514-0.00171892592537     -0.0017189259216 0.00913024106334      0.00913024106319-0.0182154767069      -0.0182154767092 0.0152611324409       0.01526113244-0.00373505297372     -0.00373505297135-0.00513225994728     -0.00513225994814-0.00531954399401     -0.00531954399641-0.0185748801227      -0.0185748801163 0.00745186105851      0.00745186105267 0.0134566626927       0.0134566626908 0.0251548691426       0.0251548691388 0.00609388350562      0.00609388350226-0.00471176815719     -0.00471176815564 0.0113580721225       0.0113580721228 0.00465172663488      0.00465172663944-0.0221326283708      -0.02213262837 0.300007655583       -0.300007655583 <-diverges, corresponding to theta2 0.155638694282       -0.15345321819 0.147747817305       -0.149026829224 0.150703152382       -0.172330417252 0.156307235611       -0.116975643856 0.136898763375       -0.170081036297 0.0621121242042      -0.0621121242372 0.0442762464937      -0.0187338352431 0.0489123689979      -0.00938236375481 0.0244392582651      -0.0465061209964 0.0237741996575      -0.028319115235 0.0313594790974      -0.0330473942922 0.106306327946       -0.106306327941 0.0348751481828      -0.0704775747806 0.0303373211657      -0.0756744476749 0.0633094699759      -0.0461971224763 0.0524239030728      -0.0477244101571 0.0633274024777      -0.0397657392082 Relative Difference: 6.61473694017

每个列表中的前20个元素对应于第一个权重矩阵的梯度，后面的18个元素对应于第二个权重矩阵的梯度。从我所看到的来看，错误似乎发生在列表中的最后18个元素（因此是theta2矩阵的梯度），在此处函数梯度开始与“正确”的数值梯度不同。这也导致scipy.optimize.fmin_cg给我以下警告：

警告：由于精度损失，未必达到期望的误差。

任何帮助将不胜感激！这是相关代码：

def sigmoid(z):    return 1 / (1+np.exp(z))def sigmoid_gradient(z):    return sigmoid(z)*(1-sigmoid(z))def randInitializeWeights(layer_in, layer_out):    matrix = np.zeros((layer_out, 1 + layer_in))    epsilon_init = 0.12    matrix = np.random.rand(layer_out, 1+layer_in) * 2 * epsilon_init -epsilon_init    return matrixdef gradient(theta, *args):    X, y, num_inputs, num_hidden_units, num_labels, lamb = args    m = len(X)    theta1 = np.reshape(theta[0:(num_hidden_units*(num_inputs+1))],(num_hidden_units, (num_inputs+1)))    theta2 = np.reshape(theta[(num_hidden_units*(num_inputs+1)):],(num_labels, num_hidden_units+1))     theta1_grad = np.zeros(theta1.shape)    theta2_grad = np.zeros(theta2.shape)    delta1 = np.zeros(theta1.shape)    delta2 = np.zeros(theta2.shape)    for t in range(0, m):        vec_y = np.zeros(num_labels)        vec_y[y[t]] = 1        vec_y = vec_y[:, np.newaxis]            #feedforward to compute all the neuron activations        a_1 = np.r_[[1], X[t]]        a_1 = a_1[:, np.newaxis]        z_2 = np.dot(theta1, a_1)           a_2 = np.vstack([1, sigmoid(z_2)])          z_3 = np.dot(theta2, a_2)           a_3 = sigmoid(z_3)          #error for output nodes        del3 = a_3 - vec_y             #error for hidden nodes        del2 = np.multiply(np.dot(theta2.T, del3), sigmoid_gradient(np.vstack([1, z_2])))        #remove bias unit        del2 = del2[1:]         #accumulate gradient        delta1 = delta1 + del2*a_1.T        delta2 = delta2 + del3*a_2.T    #no need to regularize the first column    theta1_grad[:, 0] = (1/m)*delta1[:, 0]    theta2_grad[:, 0] = (1/m)*delta2[:, 0]    #regularize the rest    theta1_grad[:, 1:] = ((1/m) * delta1[:, 1:]) + (lamb/m)*theta1[:, 1:]    theta2_grad[:, 1:] = ((1/m) * delta2[:, 1:]) + (lamb/m)*theta2[:, 1:]    #unroll    grad = np.hstack([theta1_grad.ravel(), theta2_grad.ravel()])    return graddef gradientChecking(lamb):    input_layer_size = 3    hidden_layer_size = 5    num_labels = 3    m = 5    theta1 = randInitializeWeights(input_layer_size, hidden_layer_size)    theta2 = randInitializeWeights(hidden_layer_size, num_labels)    X = np.random.rand(m, input_layer_size)    y = np.array([1, 2, 0, 1, 2])    nn_params = np.hstack([theta1.ravel(), theta2.ravel()])    #calculate gradient with function    grad = gradient(nn_params, X, y, input_layer_size, hidden_layer_size, num_labels, lamb)    #calculate numerical gradient    num_grad = computeNumericalGradient(lambda theta: computeCost(theta, X, y, input_layer_size, hidden_layer_size, num_labels, lamb), nn_params)    print('Function Gradient', 'Numerical Gradient')    for i in range(len(grad)):        print(grad[i], num_grad[i])    diff = np.linalg.norm(num_grad-grad)/np.linalg.norm(num_grad+grad)    print('Relative Difference: ')    print(diff)def computeNumericalGradient(J, theta):    numgrad = np.zeros(theta.shape)    perturb = np.zeros(theta.shape)    e = 0.0001    for p in range(1, np.size(theta)):        perturb[p] = e        loss1 = J(theta - perturb)        loss2 = J(theta + perturb)        numgrad[p] = (loss2 - loss1) / (2*e)        perturb[p] = 0    return numgrad

回答：

你的sigmoid函数中有一个错误。应该这样写：

def sigmoid(z):    return 1 / (1+np.exp(-z))

我对反向传播算法的实现有点困惑。我会不使用for循环来实现它。

你没有发布你的computeCost函数，所以我编写了它并检查了梯度。在我的情况下，两列是相等的：

('Function Gradient', 'Numerical Gradient')(-0.0087363416123043425, 0.0)(0.017468375248392107, 0.0174683752529603)(-0.0016267134050363559, -0.0016267134039793518)(0.0018882373947080224, 0.0018882373997719526)(-0.0063531428795779391, -0.0063531428762253483)(0.0029882213493977773, 0.0029882213481435826)(0.014295787205089885, 0.014295787205131916)(-0.026668095974979808, -0.026668095973736428)(0.0043373799514851595, 0.0043373799440971084)(0.0063740837472641377, 0.0063740837497050506)(0.0027102260448642525, 0.0027102260435896142)(0.0067009063282609839, 0.0067009063298151261)(-0.0029645476578591843, -0.0029645476562478734)(-0.012000477453137556, -0.012000477451756808)(-0.020065071389262716, -0.020065071393293721)(0.010308693441913186, 0.010308693438876304)(-0.0015996484140612609, -0.0015996484115099463)(-0.0086037766244218914, -0.0086037766244828617)(-0.0099431361329477934, -0.0099431361344493041)(0.0062574996404342166, 0.0062574996406716821)(0.30213488769328123, 0.3021348876908192)(0.14900524972537924, 0.14900524972549789)(0.13305168538400619, 0.13305168538479961)(0.16730920742910549, 0.16730920743279754)(0.14245586995768528, 0.14245586995365045)(0.15465244296463604, 0.15465244296519742)(0.10813908901043021, 0.10813908900342284)(0.040844058224880242, 0.04084405822446513)(0.040566215206120269, 0.040566215204762557)(0.036451467449020114, 0.036451467448905817)(0.065664340475228455, 0.065664340476168093)(0.070753692265581092, 0.07075369226283712)(0.088651862157018618, 0.088651862166777562)(0.028272897964677978, 0.028272897965031518)(0.026876928049457398, 0.026876928049812676)(0.056512225949437798, 0.056512225949933992)(0.051775047342360533, 0.051775047342772496)(0.025689087137289929, 0.025689087135294386)Relative Difference: 0.00878484310135

这是我的代码：

...

学技术

尝试理解3层神经网络中的梯度检查错误

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复