在使用5折交叉验证时，F1分数和AUC分数在高度不平衡数据中的混淆

我尝试使用5折交叉验证对高度不平衡的数据进行分类。我的样本量是：

总样本数：12237899

阳性样本数：1064（占总数的0.01%）

我还想避免数据泄露。然而，我得到了相当低的平均精确度分数和F-1分数。由于SMOTE在极度不平衡数据中表现不佳，我使用了加权逻辑回归来处理不平衡数据。另外，我在sklearn库中看到了几种F-1分数的选项。例如：f1 score有一个参数如：average{‘micro’, ‘macro’, ‘samples’,’weighted’, ‘binary’}。我不确定应该使用哪一个？还有，它与cross_val_score(clf, X, y, cv=5, scoring=’f1′)中的scoring=’f1’参数有什么不同？

from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import StratifiedKFold, train_test_splitfrom tqdm import tqdmfrom sklearn.metrics import roc_auc_score, balanced_accuracy_score, f1_score, accuracy_score, confusion_matrixfrom sklearn.metrics import roc_curve, aucBalanced_Acc = []F1 = []G=[]AP=[]aucs = []tprs = []#fi = []#rf_pi_train = []#rf_pi_test = []mean_fpr = np.linspace(0, 1, 100)acc = []cm = []i=0skf = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)for trainIndex, textIndex in tqdm(skf.split(X, y)):    xTrain, xTest = X.iloc[trainIndex], X.iloc[textIndex]    yTrain, yTest = y[trainIndex], y[textIndex]    clf = LogisticRegression(class_weight='balanced',max_iter=100000)    clf.fit(xTrain, yTrain)    yPred = clf.predict(xTest)    Balanced_Acc.append(balanced_accuracy_score(yTest, yPred))    AP.append(average_precision_score(yTest, yPred))    F1.append(f1_score(yTest,yPred))    G.append(geometric_mean_score(yTest,yPred))    #fi.append(clf.feature_importances_)    #result_train = permutation_importance(clf, xTrain, yTrain, n_repeats=1)    #result_test = permutation_importance(clf, xTest, yTest, n_repeats=1)    #rf_pi_train.append(result_train.importances)    #rf_pi_test.append(result_test.importances)    acc.append(accuracy_score(yTest, yPred))    cm.append(confusion_matrix(yTest,yPred))        # ROC Curve    fpr, tpr, thresholds = roc_curve(yTest, yPred)    tprs.append(interp(mean_fpr, fpr, tpr))    tprs[-1][0] = 0.0    roc_auc = auc(fpr, tpr)    aucs.append(roc_auc)    plt.plot(fpr, tpr, lw=1, alpha=0.3,             label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))    i = i+1    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Chance', alpha=.8)mean_tpr = np.mean(tprs, axis=0)mean_tpr[-1] = 1.0mean_auc = auc(mean_fpr, mean_tpr)std_auc = np.std(aucs)plt.plot(mean_fpr, mean_tpr, color='b',         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.3f)' % (mean_auc, std_auc),         lw=2, alpha=.8)std_tpr = np.std(tprs, axis=0)tprs_upper = np.minimum(mean_tpr + std_tpr, 1)tprs_lower = np.maximum(mean_tpr - std_tpr, 0)plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,                 label=r'$\pm$ 1 std. dev.')plt.xlim([-0.05, 1.05])plt.ylim([-0.05, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('Receiver operating characteristic')plt.legend(loc="lower right")plt.show()    # print(cm[0])tp = fp = fn = tn = 0for m in cm:    tp += m[0][0]    fp += m[0][1]        fn += m[1][0]    tn += m[1][1]    # print(tp, fp, fn, tn)finalCM = [[tp, fp], [fn, tn]]print(finalCM)ax = sns.heatmap(finalCM, annot=True, cbar=False, fmt='g')bottom, top = ax.get_ylim()ax.set_ylim(bottom + 0.5, top - 0.5)plt.ylabel('True Label')plt.xlabel('Predicted Label')plt.title('Confusion Matrix')print("Balanced Accuracy: ", np.mean(Balanced_Acc))print("AP score: ", np.mean(AP))print("G-mean: ", np.mean(G))print("F1: ", np.mean(F1))print('AUC: ', np.mean(aucs))#AUC_rf = aucs

我不确定为什么平衡准确率和AUC分数相同！我很感激你的想法！谢谢！

回答：

你实际上在问三个独立的问题：

为什么ROC AUC和平衡准确率这么高？
为什么平均精确度和F1分数这么低？
对于不平衡分类，哪种F1分数是合适的？

提醒

敏感性公式：sensitivity = TP / (TP + FN)

假阳性率公式：FPR = FP / (FP + TN)

特异性公式：specificity = 1 - FPR

在阳性类不平衡的情况下，FPR中的TN是主要原因。

让我们看一个模拟的例子：

from sklearn.metrics import classification_reportimport numpy as npy_true = np.concatenate([np.ones(10), np.zeros(99990)])y_pred = np.concatenate([np.zeros(9), np.ones(1), np.zeros(99990)])print(classification_report(y_true, y_pred))

输出结果如下：

              precision    recall  f1-score   support         0.0       1.00      1.00      1.00     99990         1.0       1.00      0.10      0.18        10    accuracy                           1.00    100000   macro avg       1.00      0.55      0.59    100000weighted avg       1.00      1.00      1.00    100000

在二分类情况下，敏感性是阳性类的召回率，因此为0.1。

同样，特异性是阴性类的召回率，因此为1.0。

FPR是1 - sensitivity = 1 - 0.1 = 0.9。

问题在哪里？

ROC AUC

ROC AUC计算的是所有可能阈值下敏感性加权的FPR总和。由于高度不平衡的阴性类导致FPR膨胀，模型在没有太多努力下就得到了高ROC AUC分数。

平衡准确率

现在我们明白了这一点，平衡准确率也很高的原因就很清楚了。看一下公式：balanced accuracy = mean(specificity, sensitivity)。由于specificity被膨胀，简单的平均值也偏向多数类。

如何解决？

通过在sklearn.metrics.balanced_accuracy_score中指定adjusted=True，可以调整平衡准确率以适应类不平衡。至于ROC AUC，替代方案是使用精确度-召回率AUC，它正是sklearn.metrics.average_precision_score。

关于f1分数选项呢？

二分类的默认设置是仅计算阳性类的f1分数。如文档中所述，默认值是average='binary'。

让我们在我们的合成示例上比较所有average选项：

f1_score(y_true, y_pred, average='binary')   # 0.1818...f1_score(y_true, y_pred, average='micro')    # 0.9991...f1_score(y_true, y_pred, average='macro')    # 0.5908...f1_score(y_true, y_pred, average='weighted') # 0.9998...

（None返回阳性和阴性类的f1分数元组，而’samples’在我们的情况下不适用）

提醒相关内容：

精确度公式：precision = TP / (TP + FP)

召回率公式：recall = TP / (TP + FN)

f1分数：f1_score = 2 * precision * recall / (precision + recall)

由于它不考虑TN，默认的f1分数忽略了模型成功检测多数类的能力。在某些情况下，这可能过于严苛，因此其他选项试图通过不同的策略来考虑这一点：

average="micro"计算阳性和阴性类的TP、FP、FN，将它们相加，然后计算精确度、召回率和f1分数。
average="macro"分别计算每个类的TP、FP、FN，然后计算每个类的f1分数，最后计算所有f1分数的未加权平均值。
average="weighted"执行average="macro"，但使用支持（即每个类的样本数）加权平均。

选择哪种f1分数在很大程度上取决于应用。从我的经验来看，average="binary"对模型性能过于严苛，但我没有遇到过像你这样严重的类不平衡情况。

在你的情况下，AP和F1分数如此低是因为模型无法成功预测阳性类。有很多策略，我建议一个对我有用的方法：选择多数类的一个代表性但规模较小的子集。

关于实例选择、选择性最近邻和迭代案例过滤等方法有很多。我发现这篇文章非常有信息量。

学技术

在使用5折交叉验证时，F1分数和AUC分数在高度不平衡数据中的混淆

提醒

问题在哪里？

ROC AUC

平衡准确率

如何解决？

关于f1分数选项呢？

发表回复取消回复

提醒

问题在哪里？

ROC AUC

平衡准确率

如何解决？

关于f1分数选项呢？

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复