这是我在这里提出的第一个问题,请告诉我是否有任何不妥之处!
我使用了sklearn来构建一个包含3个不同估计器的集成投票分类器。我首先通过调用est.fit()
使用相同的数据来拟合这3个估计器。
由于其中2个估计器的拟合非常耗时,因此第一个数据集较小。
现在我想用不同的数据再次拟合第三个估计器。有办法实现这一点吗?
我尝试通过以下方式访问估计器:ens.estimators_[2].fit(X_largedata, y_largedata)
这样做不会抛出错误,但我不知道这是拟合估计器的一个副本,还是实际集成中的那个。
之后调用ens.predict(X_test)
会导致以下错误:(如果我不尝试拟合第三个估计器,predict方法可以正常工作)
ValueError Traceback (most recent call last)<ipython-input-438-65c955f40b01> in <module>----> 1 pred_ens2 = ens.predict(X_test_ens2) 2 print(ens.score(X_test_ens2, y_test_ens2)) 3 confusion_matrix(pred_ens2, y_test_ens2).ravel()~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in predict(self, X) 280 check_is_fitted(self) 281 if self.voting == 'soft':--> 282 maj = np.argmax(self.predict_proba(X), axis=1) 283 284 else: # 'hard' voting~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in _predict_proba(self, X) 300 """Predict class probabilities for X in 'soft' voting.""" 301 check_is_fitted(self)--> 302 avg = np.average(self._collect_probas(X), axis=0, 303 weights=self._weights_not_none) 304 return avg~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in _collect_probas(self, X) 295 def _collect_probas(self, X): 296 """Collect results from clf.predict calls."""--> 297 return np.asarray([clf.predict_proba(X) for clf in self.estimators_]) 298 299 def _predict_proba(self, X):~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in <listcomp>(.0) 295 def _collect_probas(self, X): 296 """Collect results from clf.predict calls."""--> 297 return np.asarray([clf.predict_proba(X) for clf in self.estimators_]) 298 299 def _predict_proba(self, X):~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs) 117 118 # lambda, but not partial, allows help() to work with update_wrapper--> 119 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs) 120 # update the docstring of the returned function 121 update_wrapper(out, self.fn)~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/pipeline.py in predict_proba(self, X) 461 Xt = X 462 for _, name, transform in self._iter(with_final=False):--> 463 Xt = transform.transform(Xt) 464 return self.steps[-1][-1].predict_proba(Xt) 465 ~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X) 596 if (n_cols_transform >= n_cols_fit and 597 any(X.columns[:n_cols_fit] != self._df_columns)):--> 598 raise ValueError('Column ordering must be equal for fit ' 599 'and for transform when using the ' 600 'remainder keyword')ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword
编辑: 我已经解决了这个错误!错误是由小数据集的列数比大数据集多引起的。这可能是个问题,因为第一次用小数据集拟合时,变换器被告知将会有这些列(?)。一旦它们有了相同的列(和列顺序),就能够工作了。看起来这是只训练一个特定估计器的正确方法,但如果有更好的方法或者您认为我错了,请告诉我。
回答:
看起来,个体分类器被存储在一个可以用.estimators_
访问的列表中。这个列表的各个条目是具有.fit
方法的分类器。以逻辑回归为例:
from sklearn.datasets import make_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import VotingClassifierX1, y1 = make_classification(random_state=1)X2, y2 = make_classification(random_state=2)clf1 = LogisticRegression(random_state=1)clf2 = LogisticRegression(random_state=2)clf3 = LogisticRegression(random_state=3)voting = VotingClassifier(estimators=[ ('a', clf1), ('b', clf2), ('c', clf3),])# 拟合所有voting = voting.fit(X1,y1)# 拟合单个voting.estimators_[-1].fit(X2,y2)voting.predict(X2)
编辑:estimators
和estimators_
的区别
.estimators
这是一个包含元组的列表,形式为(名称,估计器):
for e in voting.estimators: print(e)('a', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=1, solver='warn', tol=0.0001, verbose=0, warm_start=False))('b', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=2, solver='warn', tol=0.0001, verbose=0, warm_start=False))('c', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=3, solver='warn', tol=0.0001, verbose=0, warm_start=False))
.estimators_
这只是一个估计器列表,不包含名称:
for e in voting.estimators_: print(e)LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=1, solver='warn', tol=0.0001, verbose=0, warm_start=False)LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=2, solver='warn', tol=0.0001, verbose=0, warm_start=False)LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=3, solver='warn', tol=0.0001, verbose=0, warm_start=False)
有趣的是,
然而,
voting.estimators[0][1] == voting.estimators_[0]
评估为 False
,所以这些条目似乎不是相同的。
投票分类器的预测方法使用.estimators_
列表。
检查 源代码 的第295-323行