Tensorflow: 使用TORQUE和GPU启动新会话时的问题:CUDA_ERROR_INVALID_DEVICE

我正在尝试解决我们集群上使用Tensorflow v1.0.1与GPU和TORQUE v6.1.0结合MOAB作为作业调度器时出现的问题。

错误发生在执行的Python脚本尝试启动新会话时:

[...]with tf.Session() as sess:[...]

错误信息如下:

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locallyI tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locallyI tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locallyI tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locallyI tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locallyE tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICELoad Data...input: (12956, 128, 128, 1)output: (12956, 64, 64, 16)Initiliaze trainingTraceback (most recent call last):  File "[...]/train.py", line 154, in <module>tf.app.run()  File "[...]/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 44, in run_sys.exit(main(_sys.argv[:1] + flags_passthrough))  File "[...]/train.py", line 150, in maintraining()  File "[...]/train.py", line 72, in trainingwith tf.Session() as sess:  File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1176, in __init__super(Session, self).__init__(target, graph, config=config)  File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 552, in __init__self._session = tf_session.TF_NewDeprecatedSession(opts, status)  File "[...]/python/3.5.1/lib/python3.5/contextlib.py", line 66, in __exit__next(self.gen)  File "[...]/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_statuspywrap_tensorflow.TF_GetCode(status))tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

为了重现这个问题,我直接在离线的GPU节点上执行了脚本(不涉及TORQUE),它没有抛出任何错误。因此,我认为问题与TORQUE有关,但我还没有找到解决方案。

TORQUE的参数如下:

#PBS -l nodes=1:ppn=2:gpus=4:exclusive_process#PBS -l mem=25gb

我尝试过一次不使用exclusive_process,但作业没有执行。我认为当涉及GPU时,我们的调度器需要这个标志。


回答:

我想我找到了一种通过将计算模式从’exclusive_process’更改为’shared’来运行作业的方法。

现在作业已经启动,并且似乎在进行计算。但我问自己是否所有四个GPU都在使用,因为nvidia-smi的输出显示所有GPU都在处理同一个进程,为什么会这样?

    Fri May 26 13:41:33 2017       +-----------------------------------------------------------------------------+| NVIDIA-SMI 375.26                 Driver Version: 375.26                    ||-------------------------------+----------------------+----------------------+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC || Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. ||===============================+======================+======================||   0  Tesla K80           On   | 0000:04:00.0     Off |                    0 || N/A   45C    P0    58W / 149W |  10871MiB / 11439MiB |      0%      Default |+-------------------------------+----------------------+----------------------+|   1  Tesla K80           On   | 0000:05:00.0     Off |                    0 || N/A   37C    P0    70W / 149W |  10873MiB / 11439MiB |      0%      Default |+-------------------------------+----------------------+----------------------+|   2  Tesla K80           On   | 0000:84:00.0     Off |                    0 || N/A   32C    P0    59W / 149W |  10871MiB / 11439MiB |      0%      Default |+-------------------------------+----------------------+----------------------+|   3  Tesla K80           On   | 0000:85:00.0     Off |                    0 || N/A   58C    P0   143W / 149W |  11000MiB / 11439MiB |     95%      Default |+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+| Processes:                                                       GPU Memory ||  GPU       PID  Type  Process name                               Usage      ||=============================================================================||    0     11757    C   python                                       10867MiB ||    1     11757    C   python                                       10869MiB ||    2     11757    C   python                                       10867MiB ||    3     11757    C   python                                       10996MiB |+-----------------------------------------------------------------------------+

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注