我想使用StackingClassifier和VotingClassifier结合StratifiedKFold和cross_val_score。当我使用StackingClassifier或VotingClassifier时,cross_val_score返回了nan值。如果我使用其他算法替代StackingClassifier或VotingClassifier,cross_val_score就能正常工作。我使用的是Python 3.8.5和sklearn 0.23.2版本。
更新代码为工作示例。请使用来自Kaggle的Parkinsons数据集 Parkinsons Dataset 这是我一直在使用的数据集,以下是我所遵循的确切步骤。
import numpy as npimport pandas as pdfrom sklearn import datasetsfrom sklearn import preprocessingfrom sklearn import metricsfrom sklearn import model_selectionfrom sklearn import feature_selectionfrom imblearn.over_sampling import SMOTEfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import StackingClassifierfrom sklearn.ensemble import VotingClassifierfrom sklearn.ensemble import RandomForestClassifierimport warningswarnings.filterwarnings('ignore')dataset = pd.read_csv('parkinsons.csv')FS_X=dataset.iloc[:,:-1]FS_y=dataset.iloc[:,-1:]FS_X.drop(['name'],axis=1,inplace=True)select_k_best = feature_selection.SelectKBest(score_func=feature_selection.f_classif,k=15)X_k_best = select_k_best.fit_transform(FS_X,FS_y)supportList = select_k_best.get_support().tolist()p_valuesList = select_k_best.pvalues_.tolist()toDrop=[]for i in np.arange(len(FS_X.columns)): bool = supportList[i] if(bool == False): toDrop.append(FS_X.columns[i]) FS_X.drop(toDrop,axis=1,inplace=True) smote = SMOTE(random_state=7)Balanced_X,Balanced_y = smote.fit_sample(FS_X,FS_y)before = pd.merge(FS_X,FS_y,right_index=True, left_index=True)after = pd.merge(Balanced_X,Balanced_y,right_index=True, left_index=True)b=before['status'].value_counts()a=after['status'].value_counts()print('Before')print(b)print('After')print(a)SkFold = model_selection.StratifiedKFold(n_splits=10, random_state=7, shuffle=False)estimators_list = list()KNN = KNeighborsClassifier()RF = RandomForestClassifier(criterion='entropy',random_state = 1)DT = DecisionTreeClassifier(criterion='entropy',random_state = 1)GNB = GaussianNB()LR = LogisticRegression(random_state = 1)estimators_list.append(LR)estimators_list.append(RF)estimators_list.append(DT)estimators_list.append(GNB)SCLF = StackingClassifier(estimators = estimators_list,final_estimator = KNN,stack_method = 'predict_proba',cv=SkFold,n_jobs = -1)VCLF = VotingClassifier(estimators = estimators_list,voting = 'soft',n_jobs = -1)scores1 = model_selection.cross_val_score(estimator = SCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)print('StackingClassifier Scores',scores1)scores2 = model_selection.cross_val_score(estimator = VCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)print('VotingClassifier Scores',scores2)scores3 = model_selection.cross_val_score(estimator = DT,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)print('DecisionTreeClassifier Scores',scores3)
输出
Before1 1470 48Name: status, dtype: int64After1 1470 147Name: status, dtype: int64StackingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]VotingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]DecisionTreeClassifier Scores [0.86666667 0.9 0.93333333 0.86666667 0.96551724 0.82758621 0.75862069 0.86206897 0.86206897 0.93103448]
我查看了Stackoverflow上的一些相关帖子,但无法解决我的问题。我无法理解自己哪里做错了。
回答:
传递给StackingClassifier
或VotingClassifier
的estimators_list
是错误的。根据sklearn关于StackingClassifier的文档说明:
将要堆叠在一起的基础估计器。列表中的每个元素被定义为一个字符串(即名称)和一个估计器实例的元组。一个估计器可以使用set_params设置为’drop’。
所以一个正确的列表应该如下所示:
KNN = KNeighborsClassifier()DT = DecisionTreeClassifier(criterion="entropy")GNB = GaussianNB()estimators_list = [("KNN", KNN), ("DT", DT), ("GNB", GNB)]
使用你的Parkinsons数据,一个完整的最小工作示例可以如下所示: