我有一个简单的sklearn类,我希望将其用作sklearn管道的一部分。这个类接收一个pandas数据框X_DF
和一个分类列名,然后调用pd.get_dummies
来返回一个将该列转换为虚拟变量矩阵的数据框…
现在单独使用这个变换器进行拟合/变换时,我得到了预期的输出。对于如下的一些玩具数据:
from sklearn import datasets# 加载玩具数据 iris = datasets.load_iris()X = pd.DataFrame(iris.data, columns = iris.feature_names)y = pd.Series(iris.target, name='y')# 创建任意分类特征X['category_1'] = pd.cut(X['sepal length (cm)'], bins=3, labels=['small', 'medium', 'large'])X['category_2'] = pd.cut(X['sepal width (cm)'], bins=3, labels=['small', 'medium', 'large'])
…我的虚拟编码器产生了正确的输出:
encoder = dummy_var_encoder(column_to_dummy = 'category_1')encoder.fit(X)encoder.transform(X).iloc[15:21,:]category_1 category_1 category_1_small category_1_medium category_1_large15 medium 0 1 016 small 1 0 017 small 1 0 018 medium 0 1 019 small 1 0 020 small 1 0 0
然而,当我从如下定义的sklearn管道中调用相同的变换器时:
from sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import KFold, GridSearchCV# 定义管道clf = LogisticRegression(penalty='l1')pipeline_steps = [('dummy_vars', dummy_var_encoder()), ('clf', clf) ]pipeline = Pipeline(pipeline_steps)# 定义虚拟编码器和分类器尝试的超参数# 拟合4个模型 - 尝试对category_1和category_2进行虚拟化,以及在逻辑回归中使用l1和l2惩罚param_grid = {'dummy_vars__column_to_dummy': ['category_1', 'category_2'], 'clf__penalty': ['l1', 'l2'] }# 定义完整的模型搜索过程 cv_model_search = GridSearchCV(pipeline, param_grid, scoring='accuracy', cv = KFold(), refit=True, verbose = 3)
一切正常,直到我拟合管道,此时虚拟编码器抛出了错误:
cv_model_search.fit(X,y=y)
In [101]: cv_model_search.fit(X,y=y) 拟合每个候选者的3个折叠,总共12次拟合
None None None None [CV] dummy_vars__column_to_dummy=category_1, clf__penalty=l1 ………
Traceback (most recent call last):
File “”, line 1, in cv_model_search.fit(X,y=y)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_search.py”, line 638, in fit cv.split(X, y, groups)))
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 779, in call while self.dispatch_one_batch(iterator):
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 625, in dispatch_one_batch self._dispatch(tasks)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 111, in apply_async result = ImmediateResult(func)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 332, in init self.results = batch()
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in call return [func(*args, **kwargs) for func, args, kwargs in self.items]
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 437, in _fit_and_score estimator.fit(X_train, y_train, **fit_params)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py”, line 257, in fit Xt, fit_params = self._fit(X, y, **fit_params)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py”, line 222, in _fit **fit_params_steps[name])
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py”, line 362, in call return self.func(*args, **kwargs)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py”, line 589, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/base.py”, line 521, in fit_transform return self.fit(X, y, **fit_params).transform(X)
File “”, line 21, in transform dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py”, line 1964, in getitem return self._getitem_column(key)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py”, line 1971, in _getitem_column return self._get_item_cache(key)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/generic.py”, line 1645, in _get_item_cache values = self._data.get(item)
File “/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/internals.py”, line 3599, in get raise ValueError(“cannot label index with a null key”)
ValueError: cannot label index with a null key
回答: