在scikit-learn的GridSearchCV中评估交叉验证分数的均值和标准差

我使用Python 2.7和scikit-learn进行机器学习。我使用网格搜索来确定我的数据集和随机森林分类器的最佳超参数。我使用留一法交叉验证，并使用ROC曲线下面积作为评估每组超参数的指标。我的代码可以运行，但我对clf.grid_scores_的输出有些困惑。据我所知，每组超参数应该在所有数据折叠中进行评估，以查看使用所有其他折叠训练的模型对留出的折叠的预测效果如何。这将为每个折叠提供一个AUROC。网格搜索然后应该报告每组超参数在所有折叠上的均值和标准差。使用.grid_scores_，我们可以查看每组超参数的AUROC的均值、标准差和原始值。

我的问题是，为什么报告的交叉验证分数的均值和标准差与实际计算所有折叠的AUROC值的.mean()和.std()不一致？

代码如下：

from sklearn import cross_validation, grid_searchfrom sklearn.ensemble import RandomForestClassifierlol = cross_validation.LeaveOneLabelOut(group_labels)rf = RandomForestClassifier(random_state=42, n_jobs=96)parameters = {'min_samples_leaf':[500,1000],              'n_estimators': [100],              'criterion': ['entropy',],              'max_features': ['sqrt']             }clf = grid_search.GridSearchCV(rf, parameters, scoring='roc_auc', cv=lol)clf.fit(train_features, train_labels)for params, mean_score, scores in clf.grid_scores_:    print("%0.3f (+/-%0.3f) for %r" % (scores.mean(), scores.std(), params))printfor g in clf.grid_scores_: print gprintprint clf.best_score_print clf.best_estimator_

输出如下：

0.603 (+/-0.108) for {'max_features': 'sqrt', 'n_estimators': 100, 'criterion': 'entropy', 'min_samples_leaf': 500}0.601 (+/-0.108) for {'max_features': 'sqrt', 'n_estimators': 100, 'criterion': 'entropy', 'min_samples_leaf': 1000}mean: 0.60004, std: 0.10774, params: {'max_features': 'sqrt', 'n_estimators': 100, 'criterion': 'entropy', 'min_samples_leaf': 500}mean: 0.59705, std: 0.10821, params: {'max_features': 'sqrt', 'n_estimators': 100, 'criterion': 'entropy', 'min_samples_leaf': 1000}0.600042993354RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',            max_depth=None, max_features='sqrt', max_leaf_nodes=None,            min_samples_leaf=500, min_samples_split=2,            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=96,            oob_score=False, random_state=42, verbose=0, warm_start=False)

为什么我计算第一个分类器的均值为0.603，而网格搜索报告为0.60004？（第二个均值也有类似的差异？）我觉得要么是我错过了帮助我找到最佳超参数集的重要信息，要么是sklearn中的一个bug。

回答：

我起初也感到困惑，所以我查看了源代码。以下两行代码将澄清交叉验证误差的计算方式：

this_score *= this_n_test_samples n_test_samples += this_n_test_samples

当网格搜索计算均值时，它是一个加权均值。您的LeaveOneLabelOut CV很可能是不平衡的，即每个标签的样本数量不同。要计算均值验证分数，您需要将每个分数乘以该折叠包含的总样本比例，然后将所有分数相加。

学技术

在scikit-learn的GridSearchCV中评估交叉验证分数的均值和标准差

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复