我正在处理一个名为TelcoSigtel的数据集,该数据集包含5000个观测样本,21个特征,以及一个不平衡的目标变量,其中86%为非流失客户,16%为流失客户。
抱歉,我原本想展示数据框的一部分,但数据量太大,或者当我尝试选取一小部分时,流失客户的数量不足。
我的问题是,下面两种方法应该给出相同的结果,但实际上在某些算法上结果差异显著,而在其他算法上结果完全相同。
关于数据集的信息:
models = [('logit', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=600, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ....]# 方法一:from sklearn import model_selectionfrom sklearn.model_selection import KFoldX = telcom.drop("churn", axis=1)Y = telcom["churn"]results = []names = []seed = 0scoring = "roc_auc"for name, model in models: kfold = model_selection.KFold(n_splits = 5, random_state = seed) cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg)# 算法比较的箱线图fig = plt.figure()fig.suptitle('Algorithm Comparison-AUC')ax = fig.add_subplot(111)plt.boxplot(results)ax.set_xticklabels(names)plt.grid()plt.show()
# 方法二:from sklearn.model_selection import KFoldfrom imblearn.over_sampling import SMOTEfrom sklearn.metrics import roc_auc_scorekf = KFold(n_splits=5, random_state=0)X = telcom.drop("churn", axis=1)Y = telcom["churn"]results = []names = []to_store1 = list()seed = 0scoring = "roc_auc"cv_results = np.array([])for name, model in models: for train_index, test_index in kf.split(X): # 分割数据 X_train, X_test = X.loc[train_index,:].values, X.loc[test_index,:].values y_train, y_test = np.ravel(Y[train_index]), np.ravel(Y[test_index]) model = model # 选择模型 model.fit(X_train, y_train ) y_pred = model.predict(X_test) to_store1.append(train_index) # 存储折叠结果 result = roc_auc_score(y_test, y_pred) cv_results = np.append(cv_results, result) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) cv_results = np.array([]) # 算法比较的箱线图fig = plt.figure()fig.suptitle('Algorithm Comparison-AUC')ax = fig.add_subplot(111)plt.boxplot(results)ax.set_xticklabels(names)plt.grid()plt.show()
回答:
简短的回答是,您应该使用model.predict_proba(X_test)[:, 1]
或model.decision_function(X_test)
来获得相同的结果,因为roc auc评分器需要类概率。详细的回答是,您可以用一个玩具示例来重现相同的行为:
import numpy as npfrom sklearn.model_selection import KFold, cross_val_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import load_breast_cancerfrom sklearn.metrics import roc_auc_score, make_scorerdef assert_equal_scores(rnd_seed, needs_threshold): """断言两种不同的评分方式,返回相同的结果。""" X, y, *_ = load_breast_cancer().values() kfold = KFold(random_state=rnd_seed) lr = LogisticRegression(random_state=rnd_seed + 10) roc_auc_scorer = make_scorer(roc_auc_score, needs_threshold=needs_threshold) cv_scores1 = cross_val_score(lr, X, y, cv=kfold, scoring=roc_auc_scorer) cv_scores2 = cross_val_score(lr, X, y, cv=kfold, scoring='roc_auc') np.testing.assert_equal(cv_scores1, cv_scores2)
尝试assert_equal_scores(10, False)
和assert_equal_scores(10, True)
(或任何其他随机种子)。第一个会引发AssertionError
。区别在于roc auc评分器需要needs_threshold
参数设置为True
。