sklearn StackingClassifer 与管道

设置如下：

我的数据集中包含一些NaN值。
我想拟合一个LogisticRegression模型，并将这些预测结果输入到HistGradiantBoostingClassifier中。
我希望HistGradiantBoostingClassifier使用其自己的内部NaN处理方法。

首先，这里有一个Debug类来帮助查看发生了什么

from sklearn.base import BaseEstimator, TransformerMixinimport numpy as npclass Debug(BaseEstimator, TransformerMixin):        def __init__(self, msg='DEBUG'):        self.msg=msg    def transform(self, X):        self.shape = X.shape        print(self.msg)        print(f'Shape: {self.shape}')        print(f'NaN count: {np.count_nonzero(np.isnan(X))}')        return X    def fit(self, X, y=None, **fit_params):        return self

现在是我的管道

from sklearn.experimental import enable_hist_gradient_boostingfrom sklearn.ensemble import HistGradientBoostingClassifierfrom sklearn.datasets import load_breast_cancerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import make_pipelinefrom sklearn.ensemble import StackingClassifierfrom sklearn.preprocessing import StandardScalerfrom sklearn.impute import SimpleImputerdata = load_breast_cancer()X = data['data']y = data['target']X[0, 0] = np.nan   # 制造一个NaNlr_pipe = make_pipeline(    Debug('lr_pipe START'),    SimpleImputer(),    StandardScaler(),    LogisticRegression())pipe = StackingClassifier(    estimators=[('lr_pipe', lr_pipe)],    final_estimator=HistGradientBoostingClassifier(),    passthrough=True,     cv=2,    verbose=10)pipe.fit(X, y)

应该发生的情况：

LogisticRegression模型在整个数据集上拟合以便后续预测（这里未使用）。
为了生成输入到HGB的特征，LogisticRegression需要使用cross_val_predict，我指定了2个折叠。我应该看到lr_pipe被调用两次以生成折外预测。

实际发生的情况：

lr_pipe STARTShape: (569, 30)NaN count: 1lr_pipe STARTShape: (284, 30)NaN count: 0lr_pipe STARTShape: (285, 30)NaN count: 1lr_pipe STARTShape: (285, 30)NaN count: 1lr_pipe STARTShape: (284, 30)NaN count: 0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

为什么lr_pipe被调用了5次？我应该看到它被调用3次。

回答：

实际上，lr_pipe的fit()函数被调用了3次，但transform()函数被调用了5次。你可以通过在fit()函数中添加print()来查看这一点。

根据StackingClassifier的文档说明：

请注意，estimators_在完整的X上进行拟合，而final_estimator_使用基估计器的交叉验证预测进行训练，使用cross_val_predict。

虽然你的estimator在完整的X上进行拟合时，transform()被调用了一次，但为了拟合final_estimator，transform()被调用了2*2次（针对2个折叠的训练集和验证集）。

学技术

sklearn StackingClassifer 与管道

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复