我安装了Theano (TH)、Tensorflow (TF) 和 Keras。基本测试显示它们能够与GPU (GTX 1070)、Cuda 8.0、cuDNN5.1 一起正常工作。
当我使用TH作为后端运行Keras示例cifar10_cnn.py时,似乎运行正常,每个epoch大约需要18秒。如果我使用TF作为后端运行,几乎每次(偶尔会成功,但无法重现),优化会在每个epoch后停滞,准确率为0.1。看起来权重没有被更新。
这很遗憾,因为使用TF作为后端时,每个epoch大约只需要10秒(即使是那很少成功的几次)。我使用的是Conda,而且我对Python非常新手。如果有帮助的话,”conda list” 显示某些包有两个版本。
如果你有任何线索,请告诉我。谢谢。以下是截图:
python cifar10_cnn.pyUsing TensorFlow backend.I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locallyI tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locallyI tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locallyI tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locallyI tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locallyX_train shape: (50000, 32, 32, 3)50000 train samples10000 test samplesUsing real-time data augmentation.Epoch 1/200I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zeroI tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 1070major: 6 minor: 1 memoryClockRate (GHz) 1.7845pciBusID 0000:01:00.0Total memory: 7.92GiBFree memory: 7.60GiBI tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)50000/50000 [==============================] - 11s - loss: 2.3029 - acc: 0.0999 - val_loss: 2.3026 - val_acc: 0.1000Epoch 2/20050000/50000 [==============================] - 10s - loss: 2.3028 - acc: 0.0980 - val_loss: 2.3026 - val_acc: 0.1000Epoch 3/20050000/50000 [==============================] - 10s - loss: 2.3028 - acc: 0.0992 - val_loss: 2.3026 - val_acc: 0.1000Epoch 4/20050000/50000 [==============================] - 10s - loss: 2.3028 - acc: 0.0980 - val_loss: 2.3026 - val_acc: 0.1000Epoch 5/20013184/50000 [======>.......................] - ETA: 7s - loss: 2.3026 - acc: 0.1044^CTraceback (most recent call last):
回答:
在我看来,这就像是随机猜测,因为有10种可能性,它有10%的时间是正确的。我能想到的唯一原因是你的学习率可能有点太高了。我见过在学习率较高的情况下,模型有时会收敛,有时不会收敛。目前在后端,我认为Theano会进行更多的优化,所以这可能稍微影响了一些东西。试着将学习率降低10倍,看看它是否会收敛。