自定义的Sklearn变换器单独使用时正常,但在管道中使用时抛出错误

我有一个简单的sklearn类,我希望将其用作sklearn管道的一部分。这个类接收一个pandas数据框X_DF和一个分类列名,然后调用pd.get_dummies来返回一个将该列转换为虚拟变量矩阵的数据框…

现在单独使用这个变换器进行拟合/变换时,我得到了预期的输出。对于如下的一些玩具数据:

from sklearn import datasets# 加载玩具数据 iris = datasets.load_iris()X = pd.DataFrame(iris.data, columns = iris.feature_names)y = pd.Series(iris.target, name='y')# 创建任意分类特征X['category_1'] = pd.cut(X['sepal length (cm)'],                          bins=3,                          labels=['small', 'medium', 'large'])X['category_2'] = pd.cut(X['sepal width (cm)'],                          bins=3,                          labels=['small', 'medium', 'large'])

…我的虚拟编码器产生了正确的输出:

encoder = dummy_var_encoder(column_to_dummy = 'category_1')encoder.fit(X)encoder.transform(X).iloc[15:21,:]category_1   category_1  category_1_small  category_1_medium  category_1_large15     medium                 0                  1                 016      small                 1                  0                 017      small                 1                  0                 018     medium                 0                  1                 019      small                 1                  0                 020      small                 1                  0                 0

然而,当我从如下定义的sklearn管道中调用相同的变换器时:

from sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import KFold, GridSearchCV# 定义管道clf = LogisticRegression(penalty='l1')pipeline_steps = [('dummy_vars', dummy_var_encoder()),                  ('clf', clf)                  ]pipeline = Pipeline(pipeline_steps)# 定义虚拟编码器和分类器尝试的超参数# 拟合4个模型 - 尝试对category_1和category_2进行虚拟化,以及在逻辑回归中使用l1和l2惩罚param_grid = {'dummy_vars__column_to_dummy': ['category_1', 'category_2'],              'clf__penalty': ['l1', 'l2']                  }# 定义完整的模型搜索过程 cv_model_search = GridSearchCV(pipeline,                                param_grid,                                scoring='accuracy',                                cv = KFold(),                               refit=True,                               verbose = 3) 

一切正常,直到我拟合管道,此时虚拟编码器抛出了错误:

cv_model_search.fit(X,y=y)

In [101]: cv_model_search.fit(X,y=y) 拟合每个候选者的3个折叠,总共12次拟合

None None None None [CV] dummy_vars__column_to_dummy=category_1, clf__penalty=l1 ………

Traceback (most recent call last):

File “”, line 1, in cv_model_search.fit(X,y=y)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_search.py”, line 638, in fit cv.split(X, y, groups)))

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 779, in call while self.dispatch_one_batch(iterator):

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 625, in dispatch_one_batch self._dispatch(tasks)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 111, in apply_async result = ImmediateResult(func)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 332, in init self.results = batch()

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in call return [func(*args, **kwargs) for func, args, kwargs in self.items]

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 437, in _fit_and_score estimator.fit(X_train, y_train, **fit_params)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py”, line 257, in fit Xt, fit_params = self._fit(X, y, **fit_params)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py”, line 222, in _fit **fit_params_steps[name])

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py”, line 362, in call return self.func(*args, **kwargs)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py”, line 589, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/base.py”, line 521, in fit_transform return self.fit(X, y, **fit_params).transform(X)

File “”, line 21, in transform dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py”, line 1964, in getitem return self._getitem_column(key)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py”, line 1971, in _getitem_column return self._get_item_cache(key)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/generic.py”, line 1645, in _get_item_cache values = self._data.get(item)

File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/internals.py”, line 3599, in get raise ValueError(“cannot label index with a null key”)

ValueError: cannot label index with a null key


回答:

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注