我正在尝试理解一个关于使用机器学习模型预测某人在泰坦尼克号沉没中生存几率的指南。
我卡在了第21个单元格上。这基本上是在尝试比较分割数据后21种不同机器学习算法的表现。所以最终结果会像下面这样:
第21个单元格:
# 机器学习算法(MLA)选择和初始化MLA = [ # 集成方法 ensemble.AdaBoostClassifier(), ensemble.BaggingClassifier(), ensemble.ExtraTreesClassifier(), ensemble.GradientBoostingClassifier(), ensemble.RandomForestClassifier(), # 高斯过程 gaussian_process.GaussianProcessClassifier(), # GLM linear_model.LogisticRegressionCV(), linear_model.PassiveAggressiveClassifier(), linear_model.RidgeClassifierCV(), linear_model.SGDClassifier(), linear_model.Perceptron(), # 朴素贝叶斯 naive_bayes.BernoulliNB(), naive_bayes.GaussianNB(), # 最近邻 neighbors.KNeighborsClassifier(), # SVM svm.SVC(probability = True), svm.NuSVC(probability = True), svm.LinearSVC(), # 树 tree.DecisionTreeClassifier(), tree.ExtraTreeClassifier(), # 判别分析 discriminant_analysis.LinearDiscriminantAnalysis(), discriminant_analysis.QuadraticDiscriminantAnalysis(), # xgboost XGBClassifier()]# 使用此分割类在交叉验证中分割数据集# 注意:这是train_test_split的替代方法cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0)# 以60/30的分割运行模型10次, intentionally leaving 10%# 创建表格比较MLA指标MLA_columns = ['MLA Name', 'MLA Parameters', 'MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD', 'MLA Time']MLA_compare = pd.DataFrame(columns = MLA_columns)# 创建表格比较MLA预测MLA_predict = data1[Target]# 遍历MLA并将性能保存到表格中row_index = 0for alg in MLA: # 设置名称和参数 MLA_name = alg.__class__.__name__ MLA_compare.loc[row_index, 'MLA Name'] = MLA_name MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params()) # 使用交叉验证评分模型 cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv = cv_split) MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean() print(cv_results.keys()) MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean() MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean() # 如果这是一个非偏见的随机样本,那么平均值加减3个标准差(std),统计上应该能够捕捉到99.7%的子集。 MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3 # 让我们了解可能发生的最坏情况! # 保存MLA预测 alg.fit(data1[data1_x_bin], data1[Target]) MLA_predict[MLA_name] = alg.predict(data1[data1_x_bin]) row_index+=1# 打印并排序表格MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)MLA_compare# MLA_predict
运行后,我得到了以下错误:
dict_keys(['fit_time', 'score_time', 'test_score'])---------------------------------------------------------------------------KeyError Traceback (most recent call last)<ipython-input-21-cbe9dc24e1e0> in <module> 67 MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean() 68 print(cv_results.keys())---> 69 MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean() 70 MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean() 71 KeyError: 'train_score'
如您所见,’train_score’ 甚至不存在于 cv_results.keys()
中。
回答:
根据 sklearn.model_selection.cross_validate
的文档,要返回 train_score
列,您需要将 return_train_score
指定为 true
,如下所示:
cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv = cv_split, return_train_score=True)