使用分层交叉验证的多种性能指标

我有一个小的、不平衡的数据集,我想用不同的算法来测试它。为了评估的目的,我需要多个性能指标(准确率、精确率、召回率、F分数、支持度)。

这是我计划的做法,但我并不完全满意,因为可能有更简单的解决方案:

skf = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)accuracy = []for train_index, test_index in skf.split(X,Y):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = Y[train_index], Y[test_index]    gradientBoost.fit(X_train, y_train)    y_pred = gradientBoost.predict(X_test)    accuracy.append(round(accuracy_score(y_test, y_pred), 2))    precision, recall, fscore, support = np.round(score(y_test, y_pred), 2)    print('precision: ' + str(precision))    print('recall: ' + str(recall))    print('fscore: ' + str(fscore))    print('support: ' + str(support))    print(classification_report(y_test, y_pred))meanAcc= np.mean(np.asarray(accuracy))print('meanAcc: ', meanAcc)

理论上,我可以像处理准确率一样对所有指标进行平均。是否有更简单和/或更有效的方法?

编辑:

我尝试绘制准确率和加权召回率作为评分器。不幸的是,图表中只显示了准确率。图例中提到了准确率和召回率。

#Initialize classifierclf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 42,                               max_depth=10, min_samples_leaf=8)scoring = {'Accuracy' : make_scorer(accuracy_score), 'Recall' : 'recall_weighted'}gs = GridSearchCV(DecisionTreeClassifier(criterion= 'entropy', random_state=42, min_samples_leaf = 10), param_grid={'max_depth' : range(2, 30, 2)},                  scoring=scoring, cv=3, refit='Accuracy')gs.fit(X_Distances, Y)results = gs.cv_results_plt.figure(figsize=(13, 13))plt.title("GridSearchCV evaluating using multiple scorers simultaneously",          fontsize=16)plt.xlabel("max_depth")plt.ylabel("Score")plt.grid()ax = plt.axes()ax.set_xlim(0, 32)ax.set_ylim(0, 1)# Get the regular numpy array from the MaskedArrayX_axis = np.array(results['param_max_depth'].data, dtype=float)for scorer, color in zip(sorted(scoring), ['g', 'k']):    for sample, style in (('train', '--'), ('test', '-')):        sample_score_mean = results['mean_%s_%s' % (sample, scorer)]        sample_score_std = results['std_%s_%s' % (sample, scorer)]        ax.fill_between(X_axis, sample_score_mean - sample_score_std,                        sample_score_mean + sample_score_std,                        alpha=0.1 if sample == 'test' else 0, color=color)        ax.plot(X_axis, sample_score_mean, style, color=color,                alpha=1 if sample == 'test' else 0.7,                label="%s (%s)" % (scorer, sample))        best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]        best_score = results['mean_test_%s' % scorer][best_index]        # Plot a dotted vertical line at the best score for that scorer marked by x        ax.plot([X_axis[best_index], ] * 2, [0, best_score],                linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)    # Annotate the best score for that scorer    ax.annotate("%0.2f" % best_score,                (X_axis[best_index], best_score + 0.005))plt.legend(loc="best")plt.grid('off')plt.show()

回答:

我们可以使用GridSearchCV进行多指标评估

# Author: Raghav RV <[email protected]># License: BSDimport numpy as npfrom matplotlib import pyplot as pltfrom sklearn.datasets import make_hastie_10_2from sklearn.model_selection import GridSearchCVfrom sklearn.metrics import make_scorerfrom sklearn.metrics import accuracy_scorefrom sklearn.tree import DecisionTreeClassifier

使用多个评估指标运行GridSearchCV

X, y = make_hastie_10_2(n_samples=8000, random_state=42)# The scorers can be either be one of the predefined metric strings or a scorer# callable, like the one returned by make_scorerscoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}# Setting refit='AUC', refits an estimator on the whole dataset with the# parameter setting that has the best cross-validated AUC score.# That estimator is made available at ``gs.best_estimator_`` along with# parameters like ``gs.best_score_``, ``gs.best_parameters_`` and# ``gs.best_index_``gs = GridSearchCV(DecisionTreeClassifier(random_state=42),                  param_grid={'min_samples_split': range(2, 403, 10)},                  scoring=scoring, cv=5, refit='AUC')gs.fit(X, y)results = gs.cv_results_

绘制结果

plt.figure(figsize=(13, 13))plt.title("GridSearchCV evaluating using multiple scorers simultaneously",          fontsize=16)plt.xlabel("min_samples_split")plt.ylabel("Score")plt.grid()ax = plt.axes()ax.set_xlim(0, 402)ax.set_ylim(0.73, 1)# Get the regular numpy array from the MaskedArrayX_axis = np.array(results['param_min_samples_split'].data, dtype=float)for scorer, color in zip(sorted(scoring), ['g', 'k']):    for sample, style in (('train', '--'), ('test', '-')):        sample_score_mean = results['mean_%s_%s' % (sample, scorer)]        sample_score_std = results['std_%s_%s' % (sample, scorer)]        ax.fill_between(X_axis, sample_score_mean - sample_score_std,                        sample_score_mean + sample_score_std,                        alpha=0.1 if sample == 'test' else 0, color=color)        ax.plot(X_axis, sample_score_mean, style, color=color,                alpha=1 if sample == 'test' else 0.7,                label="%s (%s)" % (scorer, sample))        best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]        best_score = results['mean_test_%s' % scorer][best_index]        # Plot a dotted vertical line at the best score for that scorer marked by x        ax.plot([X_axis[best_index], ] * 2, [0, best_score],                linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)    # Annotate the best score for that scorer    ax.annotate("%0.2f" % best_score,                (X_axis[best_index], best_score + 0.005))plt.legend(loc="best")plt.grid('off')plt.show()

结果:

enter image description here

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注