决策树回归模型的交叉验证得分为负

我在使用交叉验证方法评估一个决策树回归预测模型时遇到了问题，得分似乎是负值，我实在不明白为什么会这样。

这是我的代码：

all_depths = []all_mean_scores = []for max_depth in range(1, 11):    all_depths.append(max_depth)    simple_tree = DecisionTreeRegressor(max_depth=max_depth)    cv = KFold(n_splits=2, shuffle=True, random_state=13)    scores = cross_val_score(simple_tree, df.loc[:,'system':'gwno'], df['gdp_growth'], cv=cv)    mean_score = np.mean(scores)    all_mean_scores.append(np.mean(scores))    print("max_depth = ", max_depth, scores, mean_score, sem(scores))

结果如下：

max_depth =  1 [-0.45596988 -0.10215719] -0.2790635315340 0.176906344162 max_depth =  2 [-0.5532268 -0.0186984] -0.285962600541 0.267264196259 max_depth =  3 [-0.50359311  0.31992411] -0.0918345038141 0.411758610421 max_depth =  4 [-0.57305355  0.21154193] -0.180755811466 0.392297741456 max_depth =  5 [-0.58994928  0.21180425] -0.189072515181 0.400876761509 max_depth =  6 [-0.71730634  0.22139877] -0.247953784441 0.469352551213 max_depth =  7 [-0.60118621  0.22139877] -0.189893720551 0.411292487323 max_depth =  8 [-0.69635044  0.13976584] -0.278292298411 0.418058142228 max_depth =  9 [-0.78917478  0.30970763] -0.239733577455 0.549441204178 max_depth =  10 [-0.76098227  0.34512503] -0.207928623044 0.553053649792

我的问题如下：

1) 返回的得分是均方误差（MSE）对吗？如果是，为什么会是负值？

2) 我只有大约40个观测值和大约70个变量。这可能是问题所在吗？

提前感谢您的帮助。

回答：

TL,DR:

1) 不是，除非你明确指定，或者这是估计器默认的.score方法。因为你没有指定，所以它默认使用了DecisionTreeRegressor.score，它返回的是决定系数，即R^2。R^2可以是负值。

2) 是的，这是问题所在。并且这解释了为什么你得到负的决定系数。

详细信息：

你这样使用了该函数：

scores = cross_val_score(simple_tree, df.loc[:,'system':'gwno'], df['gdp_growth'], cv=cv)

所以你没有明确传递“scoring”参数。让我们看看文档：

scoring : string, callable or None, optional, default: None

字符串（参见模型评估文档）或具有签名scorer(estimator, X, y)的可调用对象/函数评分器。

所以文档没有明确说明这一点，但这可能意味着它使用了你的估计器的默认.score方法。

为了验证这个假设，让我们深入研究源代码。我们看到最终使用的评分器是以下内容：

scorer = check_scoring(estimator, scoring=scoring)

那么，让我们查看check_scoring的源代码

has_scoring = scoring is not Noneif not hasattr(estimator, 'fit'):    raise TypeError("estimator should be an estimator implementing "                    "'fit' method, %r was passed" % estimator)if isinstance(scoring, six.string_types):    return get_scorer(scoring)elif has_scoring:    # Heuristic to ensure user has not passed a metric    module = getattr(scoring, '__module__', None)    if hasattr(module, 'startswith') and \       module.startswith('sklearn.metrics.') and \       not module.startswith('sklearn.metrics.scorer') and \       not module.startswith('sklearn.metrics.tests.'):        raise ValueError('scoring value %r looks like it is a metric '                         'function rather than a scorer. A scorer should '                         'require an estimator as its first parameter. '                         'Please use `make_scorer` to convert a metric '                         'to a scorer.' % scoring)    return get_scorer(scoring)elif hasattr(estimator, 'score'):    return _passthrough_scorerelif allow_none:    return Noneelse:    raise TypeError(        "If no scoring is specified, the estimator passed should "        "have a 'score' method. The estimator %r does not." % estimator)

所以请注意，scoring=None已经被传递，因此：

has_scoring = scoring is not None

意味着has_scoring == False。此外，估计器有一个.score属性，所以我们通过这个分支：

elif hasattr(estimator, 'score'):    return _passthrough_scorer

这只是简单的：

def _passthrough_scorer(estimator, *args, **kwargs):    """Function that wraps estimator.score"""    return estimator.score(*args, **kwargs)

所以最终，我们现在知道scorer是你的估计器的默认score。让我们查看估计器的文档，它明确指出：

返回预测的决定系数R^2。

决定系数R^2定义为(1 – u/v)，其中u是回归平方和((y_true – y_pred) ** 2).sum()，v是残差平方和((y_true – y_true.mean()) ** 2).sum()。最佳可能得分是1.0，并且可以是负值（因为模型可能表现得非常差）。一个总是预测y的期望值的常数模型，不考虑输入特征，将得到0.0的R^2得分。

所以看起来你的得分实际上是决定系数。也就是说，R^2的负值意味着你的模型表现得非常差。比我们对每个输入都预测期望值（即均值）还要差。这是有道理的，因为正如你所说：

我只有大约40个观测值和大约70个变量。这可能是问题所在吗？

这确实是问题所在。当你只有40个观测值时，试图对70维的问题空间进行有意义的预测几乎是无望的。

学技术

决策树回归模型的交叉验证得分为负

TL,DR:

详细信息：

发表回复取消回复

TL,DR:

详细信息：

相关文章：

Related Posts

Keras Dense层输入未被展平

无法将分类变量输入随机森林

如何在Keras中对每个输出应用Sigmoid函数？

如何选择类概率的最佳阈值？

在Keras中使用深度学习得到不同的结果

‘MatMul’操作的输入’b’类型为float32，与参数’a’的类型float64不匹配

发表回复 取消回复

发表回复取消回复