使用ML Engine进行超参数微调:在并行试验中运行时出现Nan错误

在Google ML Engine的微调任务中,某些训练配置会导致NaN损失,从而引发错误。我希望能够忽略这些试验,并继续使用不同的参数进行微调。

我使用了NanTensorHook,并设置了fail_on_nan_loss=False ,在不进行并行试验时(maxParallelTrials: 1),它在ML Engine中运行良好,但在多个并行试验时(maxParallelTrials: 3)会失败。

之前有人遇到过这个错误吗?有什么解决方法吗?

这是我的配置文件:

trainingInput: scaleTier: CUSTOM masterType: standard workerType: standard parameterServerType: standard workerCount: 4 parameterServerCount: 1 hyperparameters:   goal: MAXIMIZE   maxTrials: 5   maxParallelTrials: 3   enableTrialEarlyStopping: False   hyperparameterMetricTag: auc   params:   - parameterName: learning_rate    type: DOUBLE    minValue: 0.0001    maxValue: 0.01    scaleType: UNIT_LOG_SCALE   - parameterName: optimizer    type: CATEGORICAL    categoricalValues:    - Adam    - Adagrad    - Momentum    - SGD   - parameterName: batch_size    type: DISCRETE    discreteValues:    - 128    - 256    - 512

这是我设置NanTensorHook的方式:

hook = tf.train.NanTensorHook(loss,fail_on_nan_loss=False)train_op = tf.contrib.layers.optimize_loss(    loss=loss, global_step=tf.train.get_global_step(),    learning_rate=lr, optimizer=optimizer)model_fn = tf.estimator.EstimatorSpec(mode=mode, loss=loss,    eval_metric_ops=eval_metric_ops, train_op=train_op,    training_hooks=[hook])

我收到的错误信息是:

Hyperparameter Tuning Trial #4 Failed before any other successful trials were completed. The failed trial had parameters: optimizer=SGD, batch_size=128, learning_rate=0.00075073617775056709, . The trial's ror message was: The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/training.py", line 421, in train_and_evaluate executor.run() File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/training.py", line 522, in run getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/training.py", line 532, in run_worker return self._start_distributed_training() File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/training.py", line 715, in _start_distributed_training saving_listeners=saving_listeners) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/estimator.py", line 352, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/estimator.py", line 891, in _train_model _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 546, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 1022, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 1113, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 1098, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 1178, in run run_metadata=run_metadata)) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/basic_session_run_hooks.py", line 617, in after_run raise NanLossDuringTrainingError NanLossDuringTrainingError: NaN loss during training. The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/training.py", line 421, in train_and_evaluate executor.run() File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/training.py", line 522, in run getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/training.py", line 532, in run_worker return self._start_distributed_training() File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/training.py", line 715, in _start_distributed_training saving_listeners=saving_listeners) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/estimator.py", line 352, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/estimator/estimator.py", line 891, in _train_model _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 546, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 1022, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 1113, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 1098, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/monitored_session.py", line 1178, in run run_metadata=run_metadata)) File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/training/basic_session_run_hooks.py", line 617, in after_run raise NanLossDuringTrainingError NanLossDuringTrainingError: NaN loss during training. 

提前感谢大家!


回答:

超参数调优任务中的不同试验在运行时是隔离的。因此,一个试验中添加的钩子不会受到其他试验中其他钩子的影响。

我怀疑问题是由试验的特定超参数组合引起的。为了确认,我建议您使用失败试验的超参数值运行一个常规训练任务,看看错误是否会再次发生。

请您将项目编号和作业ID发送到[email protected],我们可以进行进一步的调查。

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注