我修改了一个神经网络示例中的几层，只是为了看看我能不能做到。有什么问题吗？

我找到的一个简单的神经网络有w1、Relu和w2层。我尝试在中间添加一个新的权重层，并在其后添加第二个Relu层。因此，层结构如下：w1、Relu、w_mid、Relu和w2。
如果它能运行的话，它的速度比原来的三层网络慢得多。我不确定是否所有部分都得到了前向传递，并且反向传播是否在所有应该的部分都正常工作。
这个神经网络来自这个链接。它是页面上第三个代码块。

这是我修改的代码。
下面是原始代码。

    import torch    dtype = torch.float    device = torch.device("cpu")    #device = torch.device("cuda:0") # Uncomment this to run on GPU    # N is batch size; D_in is input dimension;    # H is hidden dimension; D_out is output dimension.    N, D_in, H, D_out = 64, 250, 250, 10    # Create random input and output data    x = torch.randn(N, D_in, device=device, dtype=dtype)    y = torch.randn(N, D_out, device=device, dtype=dtype)    # Randomly initialize weights    w1 = torch.randn(D_in, H, device=device, dtype=dtype)    w_mid = torch.randn(H, H, device=device, dtype=dtype)    w2 = torch.randn(H, D_out, device=device, dtype=dtype)    learning_rate = 1e-5    for t in range(5000):        # Forward pass: compute predicted y        h = x.mm(w1)        h_relu = h.clamp(min=0)        k = h_relu.mm(w_mid)        k_relu = k.clamp(min=0)        y_pred = k_relu.mm(w2)        # Compute and print loss        loss = (y_pred - y).pow(2).sum().item()        if t % 1000 == 0:            print(t, loss)        # Backprop to compute gradients of w1, mid, and w2 with respect to loss        grad_y_pred = (y_pred - y) * 2        grad_w2 = k_relu.t().mm(grad_y_pred)        grad_k_relu = grad_y_pred.mm(w2.t())        grad_k = grad_k_relu.clone()        grad_k[k < 0] = 0        grad_mid = h_relu.t().mm(grad_k)        grad_h_relu = grad_k.mm(w1.t())        grad_h = grad_h_relu.clone()        grad_h[h < 0] = 0        grad_w1 = x.t().mm(grad_h)        # Update weights        w1 -= learning_rate * grad_w1        w_mid -= learning_rate * grad_mid        w2 -= learning_rate * grad_w2

损失值是..
0 1904074240.0
1000 639.4848022460938
2000 639.4848022460938
3000 639.4848022460938
4000 639.4848022460938

这是来自Pytorch网站的原始代码。

    import torch    dtype = torch.float    #device = torch.device("cpu")    device = torch.device("cuda:0") # Uncomment this to run on GPU    # N is batch size; D_in is input dimension;    # H is hidden dimension; D_out is output dimension.    N, D_in, H, D_out = 64, 1000, 100, 10    # Create random input and output data    x = torch.randn(N, D_in, device=device, dtype=dtype)    y = torch.randn(N, D_out, device=device, dtype=dtype)    # Randomly initialize weights    w1 = torch.randn(D_in, H, device=device, dtype=dtype)    w2 = torch.randn(H, D_out, device=device, dtype=dtype)    learning_rate = 1e-6    for t in range(500):        # Forward pass: compute predicted y        h = x.mm(w1)        h_relu = h.clamp(min=0)        y_pred = h_relu.mm(w2)        # Compute and print loss        loss = (y_pred - y).pow(2).sum().item()        if t % 100 == 99:            print(t, loss)        # Backprop to compute gradients of w1 and w2 with respect to loss        grad_y_pred = 2.0 * (y_pred - y)        grad_w2 = h_relu.t().mm(grad_y_pred)        grad_h_relu = grad_y_pred.mm(w2.t())        grad_h = grad_h_relu.clone()        grad_h[h < 0] = 0        grad_w1 = x.t().mm(grad_h)        # Update weights using gradient descent        w1 -= learning_rate * grad_w1        w2 -= learning_rate * grad_w2

回答：

h_relu的梯度计算不正确。

grad_h_relu = grad_k.mm(w1.t())

这里应该是w_mid而不是w1：

grad_h_relu = grad_k.mm(w_mid.t())

除此之外，计算是正确的，但你应该降低学习率，因为初始的梯度非常大，这使得权重变得非常大，导致值溢出（无穷大），进而产生NaN损失和梯度。这被称为梯度爆炸。

在你的示例中，学习率1e-8似乎可以工作。

学技术

我修改了一个神经网络示例中的几层，只是为了看看我能不能做到。有什么问题吗？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复