Sklearn_pandas 在管道中返回 TypeError: ‘builtin_function_or_method’ 对象不可迭代

我有一个包含分类和数值特征的数据集，我想对其应用一些转换，然后使用 XGBClassifier 进行分类。

数据集链接: https://www.kaggle.com/blastchar/telco-customer-churn

由于数值和分类特征的转换不同，我使用了 sklearn_pandas 及其 DataFrameMapper。

为了对分类特征进行独热编码，我想使用 DictVectorizer。但要使用 DictVectorizer，我首先需要将数据框转换为字典，我尝试使用自定义转换器 Dictifier 来完成这个操作。

当我运行管道时，我得到了错误 ‘builtin_function_or_method’ 对象不可迭代。有人知道可能导致这个错误的原因吗？

错误跟踪

    /opt/conda/lib/python3.6/site-packages/sklearn/model_selection/_validation.py:542: FutureWarning: 从版本 0.22 开始，拟合期间的错误将默认导致交叉验证得分为 NaN。使用 error_score='raise' 如果你想要引发异常，或者使用 error_score=np.nan 来采用版本 0.22 的行为。
FutureWarning)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-187-96272018fb87> in <module>()
    53     
    54 # 执行交叉验证
---> 55 cross_val_scores = cross_val_score(pipeline, X, y, scoring="roc_auc", cv=3)
/opt/conda/lib/python3.6/site-packages/sklearn_pandas/cross_validation.py in cross_val_score(model, X, *args, **kwargs)
    19     warnings.warn(DEPRECATION_MSG, DeprecationWarning)
    20     X = DataWrapper(X)
---> 21     return sk_cross_val_score(model, X, *args, **kwargs)
    22     
    23 
/opt/conda/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    400                                 fit_params=fit_params,
    401                                 pre_dispatch=pre_dispatch,
--> 402                                 error_score=error_score)
    403     return cv_results['test_score']
    404 
/opt/conda/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    238             return_times=True, return_estimator=return_estimator,
    239             error_score=error_score)
--> 240         for train, test in cv.split(X, y, groups))
    241     
    242     zipped_scores = list(zip(*scores))
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    979             # remaining jobs.
    980             self._iterating = False
--> 981             if self.dispatch_one_batch(iterator):
    982                 self._iterating = self._original_iterator is not None
    983 
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    821                 return False
    822             else:
--> 823                 self._dispatch(tasks)
    824                 return True
    825 
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    778         with self._lock:
    779             job_idx = len(self._jobs)
--> 780             job = self._backend.apply_async(batch, callback=cb)
    781             # A job can complete so quickly than its callback is
    782             # called before we get here, causing self._jobs to
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
    181     def apply_async(self, func, callback=None):
    182         """Schedule a func to be run"""
--> 183         result = ImmediateResult(func)
    184         if callback:
    185             callback(result)
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
    541         # Don't delay the application, to avoid keeping the input
    542         # arguments in memory
--> 543         self.results = batch()
    544     
    545     def get(self):
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
    259         with parallel_backend(self._backend):
    260             return [func(*args, **kwargs)
--> 261                     for func, args, kwargs in self.items]
    262     
    263     def __len__(self):
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
    259         with parallel_backend(self._backend):
    260             return [func(*args, **kwargs)
--> 261                     for func, args, kwargs in self.items]
    262     
    263     def __len__(self):
/opt/conda/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    526             estimator.fit(X_train, **fit_params)
    527         else:
--> 528             estimator.fit(X_train, y_train, **fit_params)
    529     
    530     except Exception as e:
/opt/conda/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    263             This estimator
    264         """
--> 265         Xt, fit_params = self._fit(X, y, **fit_params)
    266         if self._final_estimator is not None:
    267             self._final_estimator.fit(Xt, y, **fit_params)
/opt/conda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    228                 Xt, fitted_transformer = fit_transform_one_cached(
    229                     cloned_transformer, Xt, y, None,
--> 230                     **fit_params_steps[name])
    231                 # Replace the transformer of the step with the fitted
    232                 # transformer. This is necessary when loading the transformer
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py in __call__(self, *args, **kwargs)
    320     
    321     def __call__(self, *args, **kwargs):
--> 322         return self.func(*args, **kwargs)
    323     
    324     def call_and_shelve(self, *args, **kwargs):
/opt/conda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, **fit_params)
    612 def _fit_transform_one(transformer, X, y, weight, **fit_params):
    613     if hasattr(transformer, 'fit_transform'):
--> 614         res = transformer.fit_transform(X, y, **fit_params)
    615     else:
    616         res = transformer.fit(X, y, **fit_params).transform(X)
/opt/conda/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    790             delayed(_fit_transform_one)(trans, X, y, weight,
    791                                         **fit_params)
--> 792             for name, trans, weight in self._iter())
    793     
    794         if not result:
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    979             # remaining jobs.
    980             self._iterating = False
--> 981             if self.dispatch_one_batch(iterator):
    982                 self._iterating = self._original_iterator is not None
    983 
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    821                 return False
    822             else:
--> 823                 self._dispatch(tasks)
    824                 return True
    825 
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    778         with self._lock:
    779             job_idx = len(self._jobs)
--> 780             job = self._backend.apply_async(batch, callback=cb)
    781             # A job can complete so quickly than its callback is
    782             # called before we get here, causing self._jobs to
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
    181     def apply_async(self, func, callback=None):
    182         """Schedule a func to be run"""
--> 183         result = ImmediateResult(func)
    184         if callback:
    185             callback(result)
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
    541         # Don't delay the application, to avoid keeping the input
    542         # arguments in memory
--> 543         self.results = batch()
    544     
    545     def get(self):
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
    259         with parallel_backend(self._backend):
    260             return [func(*args, **kwargs)
--> 261                     for func, args, kwargs in self.items]
    262     
    263     def __len__(self):
/opt/conda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
    259         with parallel_backend(self._backend):
    260             return [func(*args, **kwargs)
--> 261                     for func, args, kwargs in self.items]
    262     
    263     def __len__(self):
/opt/conda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, **fit_params)
    612 def _fit_transform_one(transformer, X, y, weight, **fit_params):
    613     if hasattr(transformer, 'fit_transform'):
--> 614         res = transformer.fit_transform(X, y, **fit_params)
    615     else:
    616         res = transformer.fit(X, y, **fit_params).transform(X)
/opt/conda/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    460         else:
    461             # fit method of arity 2 (supervised transformation)
--> 462             return self.fit(X, y, **fit_params).transform(X)
    463     
    464 
/opt/conda/lib/python3.6/site-packages/sklearn_pandas/dataframe_mapper.py in transform(self, X)
    342                 stacked,
    343                 columns=self.transformed_names_,
--> 344                 index=index)
    345             # preserve types
    346             for col, dtype in zip(self.transformed_names_, dtypes):
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    377             else:
    378                 mgr = self._init_ndarray(data, index, columns, dtype=dtype,
--> 379                                          copy=copy)
    380         elif isinstance(data, (list, types.GeneratorType)):
    381             if isinstance(data, types.GeneratorType):
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _init_ndarray(self, values, index, columns, dtype, copy)
    525                     raise_with_traceback(e)
    526 
--> 527         index, columns = _get_axes(*values.shape)
    528         values = values.T
    529 
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _get_axes(N, K, index, columns)
    482                 index = com._default_index(N)
    483             else:
--> 484                 index = _ensure_index(index)
    485     
    486             if columns is None:
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in _ensure_index(index_like, copy)
4972             index_like = copy(index_like)
4973 
-> 4974     return Index(index_like)
4975 
4976 
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
    449                         data, names=name or kwargs.get('names'))
    450             # other iterable of some kind
--> 451             subarr = com._asarray_tuplesafe(data, dtype=object)
    452             return Index(subarr, dtype=dtype, copy=copy, name=name, **kwargs)
    453 
/opt/conda/lib/python3.6/site-packages/pandas/core/common.py in _asarray_tuplesafe(values, dtype)
    303     
    304     if not (isinstance(values, (list, tuple)) or hasattr(values, '__array__')):
--> 305         values = list(values)
    306     elif isinstance(values, Index):
    307         return values.values
TypeError: 'builtin_function_or_method' 对象不可迭代

回答：

这看起来像是 sklearn_pandas.cross_val_score 中的一个错误。

sklearn_pandas 会将你提供的数据框包装在一个 DataWrapper 对象中，如源代码所示：

def cross_val_score(model, X, *args, **kwargs):
    warnings.warn(DEPRECATION_MSG, DeprecationWarning)
    X = DataWrapper(X)
    return sk_cross_val_score(model, X, *args, **kwargs)

这显然是为了处理旧版本的 sklearn.cross_validation.cross_val_score，因为它对 pandas DataFrame 的处理不够好。DataWrapper 在划分为训练和测试集时会返回一个 list 实例。

但是在 DataframeMapper 的 transform() 过程中，它没有被正确处理，如源代码所示

 if self.df_out:
        # 如果没有删除行，则保留原始索引，
        # 否则使用新的整数索引
        no_rows_dropped = len(X) == len(stacked)
        if no_rows_dropped:
            index = X.index      # <== 这是错误的来源
        else:
            index = None

在这里，X 不是一个 DataFrame，而是一个列表对象，因此 index 不是一个列表，而是 pandas 的实际索引，而是列表的函数，因此你得到了这个错误。

但由于较新的 sklearn cross_val_score 可以正确处理 DataFrame，你不需要使用其他导入。

将其从：

from sklearn_pandas import cross_val_score

改为：

from sklearn.model_selection import cross_val_score

这样你就不会再得到那个错误了。

但是，在代码的更深处，你会得到另一个关于的错误：

 AttributeError: 'numpy.ndarray' 对象没有属性 'to_dict'

这是因为你将两个 DataFrameMapper 对象包装在一个 FeatureUnion 中，通过这样做：

num_cat_union = FeatureUnion([("num_mapper", num_transf_mapper),
                            ("cat_mapper", cat_transf_mapper)])

然后这样做：

pipeline = Pipeline([("featureunion", num_cat_union),
                    ("dictifier", Dictifier()),
                    ("vectorizer", DictVectorizer(sort=False)),
                    ("clf", xgb.XGBClassifier(max_depth=3))])

你的 Dictifier 期望传递给它一个 DataFrame，以便它可以调用 to_dict()，但管道中的前一步 FeatureUnion 不会保留 DataFrame，它会将其转换为 numpy 数组。

一般来说，DataFrameMapper 和 FeatureUnion 不能很好地一起工作。我建议你完全删除 FeatureUnion，而是将你的两个 DataFrameMapper 对象 合并为一个对象。这将有效地实现你想要 FeatureUnion 实现的功能。

像这样做：

transformers = []
# 仅在此处组合你的两个操作
transformers.extend([([num_col], [Imputer(strategy="median"),
                                   StandardScaler()]) for num_col in num_cols])
transformers.extend([(cat_col , [CategoricalImputer()]) for cat_col in cat_cols])
num_cat_union = DataFrameMapper(transformers,
                                input_df=True,
                                df_out=True)
# 你的其他代码......

学技术

Sklearn_pandas 在管道中返回 TypeError: ‘builtin_function_or_method’ 对象不可迭代

发表回复取消回复

相关文章：

Related Posts

在使用k近邻算法时，有没有办法获取被使用的“邻居”？

Theano在Google Colab上无法启用GPU支持

准确性评分似乎有误

Keras Functional API: “错误检查输入时：期望input_1具有4个维度，但得到形状为(X, Y)的数组”

如何使用sklearn.datasets.make_classification在指定范围内生成合成数据？

如何处理预测时不在训练集中的标签

发表回复 取消回复

发表回复取消回复