绘制K折交叉验证的ROC曲线

我正在处理一个不平衡的数据集。在应用机器学习模型之前，我已经将数据集分成了测试集和训练集，并应用了SMOTE算法来平衡数据集。我希望应用交叉验证，并绘制每个折叠的ROC曲线，显示每个折叠的AUC值，并在图中显示AUC的平均值。我将重新采样的训练集变量命名为X_train_res和y_train_res，以下是代码：

cv = StratifiedKFold(n_splits=10)classifier = SVC(kernel='sigmoid',probability=True,random_state=0)tprs = []aucs = []mean_fpr = np.linspace(0, 1, 100)plt.figure(figsize=(10,10))i = 0for train, test in cv.split(X_train_res, y_train_res):    probas_ = classifier.fit(X_train_res[train], y_train_res[train]).predict_proba(X_train_res[test])    # Compute ROC curve and area the curve    fpr, tpr, thresholds = roc_curve(y_train_res[test], probas_[:, 1])    tprs.append(interp(mean_fpr, fpr, tpr))    tprs[-1][0] = 0.0    roc_auc = auc(fpr, tpr)    aucs.append(roc_auc)    plt.plot(fpr, tpr, lw=1, alpha=0.3,             label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))    i += 1plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',         label='Chance', alpha=.8)mean_tpr = np.mean(tprs, axis=0)mean_tpr[-1] = 1.0mean_auc = auc(mean_fpr, mean_tpr)std_auc = np.std(aucs)plt.plot(mean_fpr, mean_tpr, color='b',         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),         lw=2, alpha=.8)std_tpr = np.std(tprs, axis=0)tprs_upper = np.minimum(mean_tpr + std_tpr, 1)tprs_lower = np.maximum(mean_tpr - std_tpr, 0)plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,                 label=r'$\pm$ 1 std. dev.')plt.xlim([-0.01, 1.01])plt.ylim([-0.01, 1.01])plt.xlabel('False Positive Rate',fontsize=18)plt.ylabel('True Positive Rate',fontsize=18)plt.title('Cross-Validation ROC of SVM',fontsize=18)plt.legend(loc="lower right", prop={'size': 15})plt.show()

以下是输出：

请告诉我这段代码是否正确用于绘制交叉验证的ROC曲线。

回答：

问题是我对交叉验证的理解不够清晰。在for循环范围内，我传递了X和y变量的训练集。交叉验证是这样工作的吗？

撇开SMOTE和不平衡问题不谈，这些问题在你的代码中并未包含，你的过程看起来是正确的。

更详细地说，对于你的每个n_splits=10：

你创建train和test折叠

你使用train折叠来拟合模型：

classifier.fit(X_train_res[train], y_train_res[train])

然后你使用test折叠来预测概率：
```
predict_proba(X_train_res[test])
```

这正是交叉验证背后的理念。

因此，由于你设置了n_splits=10，你得到了10条ROC曲线和相应的AUC值（及其平均值），完全符合预期。

然而：

由于类别不平衡，需要（SMOTE）上采样，这改变了正确的过程，使你的整体过程变得不正确：你不应该在初始数据集上进行上采样；相反，你需要将上采样过程纳入CV过程。

因此，对于你的每个n_splits，正确的过程变为（请注意，在类别不平衡的情况下，首先进行分层CV分割，如你所做的那样，是至关重要的）：

创建train和test折叠
使用SMOTE对train折叠进行上采样
使用上采样的train折叠来拟合模型
使用test折叠（未上采样）来预测概率

有关理念的详细信息，请参见我在数据科学SE线程中的回答为什么你不应该在交叉验证之前进行上采样。

学技术

绘制K折交叉验证的ROC曲线

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复