TensorFlow训练无法正常工作：模型无法学习数据

我有一个包含超过1700万个观测值的数据集，我试图用它来训练一个DNNRegressor模型。然而，训练完全不起作用。损失值高达10^15，这简直令人震惊。我已经尝试了几个星期的不同方法，无论我做什么，损失值都无法降低。

例如，训练后我用一个用于训练数据的相同观测值进行测试预测。预期结果是140944.00，但预测结果却得出-169532.5，这简直荒谬。训练数据中甚至没有负值，我不明白为什么会偏差这么大。

以下是一些样本训练数据：

Amount      Contribution    ServiceType     Percentile       Time   Result214871.00   3501.00         SM23            high             50     17807828.00214871.00   3501.00         SM23            high             51     19216520.00214871.00   3501.00         SM23            high             52     19676064.00214871.00   3501.00         SM23            high             53     21038840.00214871.00   3501.00         SM23            high             54     22248295.00214871.00   3501.00         SM23            high             55     22412713.0028006.00    83.00           SM0             i_low            0      28006.0028006.00    83.00           SM0             i_low            1      28804.0028006.00    83.00           SM0             i_low            2      30140.0028006.00    83.00           SM0             i_low            3      31598.0028006.00    83.00           SM0             i_low            4      33130.0028006.00    83.00           SM0             i_low            5      34663.00

这是我的代码：

feature_columns = [    tf.feature_column.numeric_column('Amount', dtype=dtypes.float32),    tf.feature_column.numeric_column('Contribution', dtype=dtypes.float32),    tf.feature_column.embedding_column(        tf.feature_column.categorical_column_with_vocabulary_list(            'ServiceType',            [                'SM0',  'SM1',  'SM2',  'SM3',                'SM4',  'SM5',  'SM6',  'SM7',                'SM8',  'SM9',  'SM10', 'SM11',                'SM12', 'SM13', 'SM14', 'SM15',                'SM16', 'SM17', 'SM18', 'SM19',                'SM20', 'SM21', 'SM22', 'SM23'            ],            dtype=dtypes.string        ),        dimension=16    ),    tf.feature_column.embedding_column(        tf.feature_column.categorical_column_with_vocabulary_list(            'Percentile',            ['i_low', 'low', 'mid', 'high'],            dtype=dtypes.string        ),        dimension=16    ),    tf.feature_column.numeric_column('Time', dtype=dtypes.int8)]model = tf.estimator.DNNRegressor(    hidden_units=[64, 32],    feature_columns=feature_columns,    model_dir=os.getcwd() + "\job",    label_dimension=1,    weight_column=None,    optimizer='Adagrad',    activation_fn=tf.nn.elu,    dropout=None,    input_layer_partitioner=None,    config=RunConfig(        master=None,        num_cores=4,        log_device_placement=False,        gpu_memory_fraction=1,        tf_random_seed=None,        save_summary_steps=100,        save_checkpoints_secs=0,        save_checkpoints_steps=None,        keep_checkpoint_max=5,        keep_checkpoint_every_n_hours=10000,        log_step_count_steps=100,        evaluation_master='',        model_dir=os.getcwd() + "\job",        session_config=None    ))print('Training...')model.train(input_fn=get_input_fn('train'), steps=100000)print('Evaluating...')model.evaluate(input_fn=get_input_fn('test'), steps=4000)print('Predicting...')prediction = model.predict(input_fn=get_input_fn('predict'))print(list(prediction))

input_fn的计算方式如下：

def split_input():    data = pd.read_csv('C:\\all_data.txt', sep='\t')    x = data.drop('Result', axis=1)    y = data.Result    return train_test_split(x, y, test_size=0.2, random_state=123)def get_input_fn(input_fn_type):    train_x, test_x, train_y, test_y = split_input()    if input_fn_type == 'train':        return tf.estimator.inputs.pandas_input_fn(            x=train_x,            y=train_y,            num_epochs=None,            shuffle=True        )    elif input_fn_type == 'test':        return tf.estimator.inputs.pandas_input_fn(            x=test_x,            y=test_y,            num_epochs=1,            shuffle=False        )    elif input_fn_type == 'predict':        return tf.estimator.inputs.pandas_input_fn(            x=pd.DataFrame(                {                    'Amount': 52050.00,                    'Contribution': 1394.00,                    'ServiceType': 'SM0',                    'Percentile': 'i_low',                    'Time': 5                },                index=[0]            ),            num_epochs=1,            shuffle=False        )

输出结果如下：

Training...INFO:tensorflow:loss = 6.30944e+15, step = 1INFO:tensorflow:global_step/sec: 457.091INFO:tensorflow:loss = 3.28245e+15, step = 101 (0.219 sec)INFO:tensorflow:global_step/sec: 533.271INFO:tensorflow:loss = 2.65647e+15, step = 201 (0.188 sec)INFO:tensorflow:global_step/sec: 533.274...INFO:tensorflow:loss = 1.06601e+15, step = 99701 (0.203 sec)INFO:tensorflow:global_step/sec: 533.289INFO:tensorflow:loss = 2.12652e+15, step = 99801 (0.188 sec)INFO:tensorflow:global_step/sec: 533.273INFO:tensorflow:loss = 1.31647e+15, step = 99901 (0.203 sec)INFO:tensorflow:Saving checkpoints for 100000 into C:\projection_model\job\model.ckpt.INFO:tensorflow:Loss for final step: 2.88956e+15.Evaluating...INFO:tensorflow:Evaluation [1/4000]INFO:tensorflow:Evaluation [2/4000]INFO:tensorflow:Evaluation [3/4000]...INFO:tensorflow:Evaluation [3998/4000]INFO:tensorflow:Evaluation [3999/4000]INFO:tensorflow:Evaluation [4000/4000]INFO:tensorflow:Finished evaluation at 2017-08-30-19:04:03INFO:tensorflow:Saving dict for global step 100000: average_loss = 1.37941e+13, global_step = 100000, loss = 1.76565e+15Predicting...[{'predictions': array([-169532.5], dtype=float32)}] # Should be somewhere around 140944.00

为什么模型无法学习数据？我已经尝试了不同的回归器和输入归一化，但没有任何效果。

回答：

tf.contrib.learn.DNNRegressor隐藏了太多细节，如果一切正常运行，这很好，但当需要调试时就很令人沮丧。

例如，学习率可能太大。你在代码中看不到学习率，因为它是由DNNRegressor选择的。默认值是0.05，这对许多应用来说是合理的，但在你的特定情况下可能太大了。我建议你自己实例化优化器AdagradOptimizer(learning_rate)并将其传递给DNNRegressor。

初始权重也可能太大。DNNRegressor使用tf.contrib.layers.fully_connected层，而没有覆盖weights_initializer和biases_initializer。就像之前一样，默认值相当合理，但如果你想要不同的值，你根本无法控制它。

我通常做的检查神经网络是否至少在某种程度上工作的方法是将训练集减少到几个例子，并尝试让神经网络过拟合。这个实验非常快，所以我可以尝试各种学习率和其他超参数来找到一个最佳点，然后再转向更大的数据集。

进一步的故障排除：在tensorboard中可视化每一层的激活分布、梯度或权重的分布，以缩小问题的范围。

学技术

TensorFlow训练无法正常工作：模型无法学习数据

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复