modelselection.Kfold 与 kf.split 结果不同

我正在处理一个名为TelcoSigtel的数据集,该数据集包含5000个观测样本,21个特征,以及一个不平衡的目标变量,其中86%为非流失客户,16%为流失客户。

抱歉,我原本想展示数据框的一部分,但数据量太大,或者当我尝试选取一小部分时,流失客户的数量不足。

我的问题是,下面两种方法应该给出相同的结果,但实际上在某些算法上结果差异显著,而在其他算法上结果完全相同。

关于数据集的信息:

models = [('logit',  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,                     intercept_scaling=1, l1_ratio=None, max_iter=600,                     multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,                     solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ....]# 方法一:from sklearn import model_selectionfrom sklearn.model_selection import KFoldX = telcom.drop("churn", axis=1)Y = telcom["churn"]results = []names = []seed = 0scoring = "roc_auc"for name, model in models:    kfold = model_selection.KFold(n_splits = 5, random_state = seed)    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)    results.append(cv_results)    names.append(name)    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())    print(msg)# 算法比较的箱线图fig = plt.figure()fig.suptitle('Algorithm Comparison-AUC')ax = fig.add_subplot(111)plt.boxplot(results)ax.set_xticklabels(names)plt.grid()plt.show()

enter image description here

# 方法二:from sklearn.model_selection import KFoldfrom imblearn.over_sampling import SMOTEfrom sklearn.metrics import roc_auc_scorekf = KFold(n_splits=5, random_state=0)X = telcom.drop("churn", axis=1)Y = telcom["churn"]results = []names = []to_store1 = list()seed = 0scoring = "roc_auc"cv_results = np.array([])for name, model in models:    for train_index, test_index in kf.split(X):        # 分割数据        X_train, X_test = X.loc[train_index,:].values, X.loc[test_index,:].values        y_train, y_test = np.ravel(Y[train_index]), np.ravel(Y[test_index])          model = model  # 选择模型        model.fit(X_train, y_train )          y_pred = model.predict(X_test)        to_store1.append(train_index)        # 存储折叠结果        result = roc_auc_score(y_test, y_pred)        cv_results = np.append(cv_results, result)    results.append(cv_results)    names.append(name)    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())    print(msg)    cv_results = np.array([])   # 算法比较的箱线图fig = plt.figure()fig.suptitle('Algorithm Comparison-AUC')ax = fig.add_subplot(111)plt.boxplot(results)ax.set_xticklabels(names)plt.grid()plt.show()

enter image description here


回答:

简短的回答是,您应该使用model.predict_proba(X_test)[:, 1]model.decision_function(X_test)来获得相同的结果,因为roc auc评分器需要类概率。详细的回答是,您可以用一个玩具示例来重现相同的行为:

import numpy as npfrom sklearn.model_selection import KFold, cross_val_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import load_breast_cancerfrom sklearn.metrics import roc_auc_score, make_scorerdef assert_equal_scores(rnd_seed, needs_threshold):    """断言两种不同的评分方式,返回相同的结果。"""    X, y, *_ = load_breast_cancer().values()    kfold = KFold(random_state=rnd_seed)    lr = LogisticRegression(random_state=rnd_seed + 10)    roc_auc_scorer = make_scorer(roc_auc_score, needs_threshold=needs_threshold)    cv_scores1 = cross_val_score(lr, X, y, cv=kfold, scoring=roc_auc_scorer)    cv_scores2 = cross_val_score(lr, X, y, cv=kfold, scoring='roc_auc')    np.testing.assert_equal(cv_scores1, cv_scores2)

尝试assert_equal_scores(10, False)assert_equal_scores(10, True)(或任何其他随机种子)。第一个会引发AssertionError。区别在于roc auc评分器需要needs_threshold参数设置为True

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注