我在尝试使用多个sklearn分类器构建一个用于集成的投票分类器。
为了测试,我有一个包含一组列的数据框,这些列代表工具技能(从0到10的数值,表示一个人对该技能的掌握程度),以及一个“适合工作”的列作为类变量。例如:
import pandas as pddf = pd.DataFrame(columns=["Python", "Scikit-learn", "Pandas", "Fit to Job"])total_mock_samples= 100for i in range(total_mock_samples): df=df.append(mockResults(df.columns, 'Fit to Job', good_values=i > total_mock_samples/2), ignore_index=True)#用模拟数据填充数据框#输出如下:print(np.array(df))#[[1. 3. 6. 1.]# [3. 2. 3. 0.]# [1. 4. 0. 0.]# ...# [7. 8. 8. 1.]# [8. 7. 9. 1.]]
然后我构建了我的集成分类器:
from sklearn.ensemble import RandomForestClassifier, VotingClassifierfrom sklearn.svm import SVCfrom sklearn.naive_bayes import GaussianNBfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import cross_val_scoreimport numpy as npX = np.array(df[df.columns[:-1]])y = np.array(df[df.columns[-1]])rfc = RandomForestClassifier(n_estimators=10)svc = SVC(kernel='linear')knn = KNeighborsClassifier(n_neighbors=5)nb = GaussianNB()lr = LinearRegression()ensemble = VotingClassifier(estimators=[("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc), ("Linear Reg.",lr)])
最后,我尝试用交叉验证来评估它,像这样:
cval_score = cross_val_score(ensemble, X, y, cv=10)
但我得到了以下错误:
TypeError Traceback (most recent call last)<ipython-input-13-f7c01fa872d2> in <module> 182 ensemble = VotingClassifier(estimators=[("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc), ("Linear Reg.",lr)]) 183 --> 184 cval_score = cross_val_score(ensemble, X, y, cv=10)[...]TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
我查看了其他答案,但它们都涉及到numpy数据转换。错误发生在交叉验证阶段。我尝试应用他们的解决方案但没有成功。
我还尝试在计算得分之前更改数据类型,但没有成功。
也许有人能更敏锐地看出问题在哪里。
编辑01:模拟结果生成函数
def mockResults(columns, result_column_name='Fit', min_value = 0, max_value=10, good_values=False): mock_res = {} for column in columns: mock_res[column] = 0 if column == result_column_name: if good_values == True: mock_res[column] = float(1) else: mock_res[column] = float(0) elif good_values == True: mock_res[column] = float(random.randrange(int(max_value*0.7), max_value)) else: mock_res[column] = float(random.randrange(min_value, int(max_value*0.5))) return mock_res
回答:
df = pd.DataFrame(columns=["Python", "Scikit-learn", "Pandas", "Fit to Job"], data=np.random.randint(1, 10,size=(400,4))) class LinearRegressionInt(LinearRegression): def predict(self,X): predictions = self._decision_function(X) return np.asarray(predictions, dtype=np.int64).ravel()... lr = LinearRegressionInt()...ensemble = VotingClassifier(estimators=[("lr",lr),("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc)] )cval_score = cross_val_score(ensemble, X, y, cv=10)cval_scorearray([ 0.09090909, 0.11904762, 0.17073171, 0.14634146, 0.17073171, 0.15384615, 0.07692308, 0.15384615, 0.10810811, 0.08108108])
参考:投票分类器的类型错误