自定义的Sklearn变换器单独使用时正常,但在管道中使用时抛出错误

我有一个简单的sklearn类,我希望将其用作sklearn管道的一部分。这个类接收一个pandas数据框X_DF和一个分类列名,然后调用pd.get_dummies来返回一个将该列转换为虚拟变量矩阵的数据框…

现在单独使用这个变换器进行拟合/变换时,我得到了预期的输出。对于如下的一些玩具数据:

from sklearn import datasets# 加载玩具数据 iris = datasets.load_iris()X = pd.DataFrame(iris.data, columns = iris.feature_names)y = pd.Series(iris.target, name='y')# 创建任意分类特征X['category_1'] = pd.cut(X['sepal length (cm)'],                          bins=3,                          labels=['small', 'medium', 'large'])X['category_2'] = pd.cut(X['sepal width (cm)'],                          bins=3,                          labels=['small', 'medium', 'large'])

…我的虚拟编码器产生了正确的输出:

encoder = dummy_var_encoder(column_to_dummy = 'category_1')encoder.fit(X)encoder.transform(X).iloc[15:21,:]category_1   category_1  category_1_small  category_1_medium  category_1_large15     medium                 0                  1                 016      small                 1                  0                 017      small                 1                  0                 018     medium                 0                  1                 019      small                 1                  0                 020      small                 1                  0                 0

然而,当我从如下定义的sklearn管道中调用相同的变换器时:

from sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import KFold, GridSearchCV# 定义管道clf = LogisticRegression(penalty='l1')pipeline_steps = [('dummy_vars', dummy_var_encoder()),                  ('clf', clf)                  ]pipeline = Pipeline(pipeline_steps)# 定义虚拟编码器和分类器尝试的超参数# 拟合4个模型 - 尝试对category_1和category_2进行虚拟化,以及在逻辑回归中使用l1和l2惩罚param_grid = {'dummy_vars__column_to_dummy': ['category_1', 'category_2'],              'clf__penalty': ['l1', 'l2']                  }# 定义完整的模型搜索过程 cv_model_search = GridSearchCV(pipeline,                                param_grid,                                scoring='accuracy',                                cv = KFold(),                               refit=True,                               verbose = 3) 

一切正常,直到我拟合管道,此时虚拟编码器抛出了错误:

cv_model_search.fit(X,y=y)

In [101]: cv_model_search.fit(X,y=y) 拟合每个候选者的3个折叠,总共12次拟合

None None None None [CV] dummy_vars__column_to_dummy=category_1, clf__penalty=l1 ………

Traceback (most recent call last):

File “”, line 1, in cv_model_search.fit(X,y=y)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_search.py”, line 638, in fit cv.split(X, y, groups)))

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 779, in call while self.dispatch_one_batch(iterator):

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 625, in dispatch_one_batch self._dispatch(tasks)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 111, in apply_async result = ImmediateResult(func)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 332, in init self.results = batch()

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in call return [func(*args, **kwargs) for func, args, kwargs in self.items]

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 437, in _fit_and_score estimator.fit(X_train, y_train, **fit_params)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py”, line 257, in fit Xt, fit_params = self._fit(X, y, **fit_params)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py”, line 222, in _fit **fit_params_steps[name])

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py”, line 362, in call return self.func(*args, **kwargs)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py”, line 589, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/base.py”, line 521, in fit_transform return self.fit(X, y, **fit_params).transform(X)

File “”, line 21, in transform dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py”, line 1964, in getitem return self._getitem_column(key)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py”, line 1971, in _getitem_column return self._get_item_cache(key)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/generic.py”, line 1645, in _get_item_cache values = self._data.get(item)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/internals.py”, line 3599, in get raise ValueError(“cannot label index with a null key”)

ValueError: cannot label index with a null key


回答:

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注