### 管道错误 (ValueError: 仅支持使用字符串指定pandas DataFrame的列)

该示例完全可重现。以下是完整的笔记本(也包含数据下载):https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb

在上述笔记本的这一部分之后:

full_pipeline_with_predictor = Pipeline([        ("preparation", full_pipeline),        ("linear", LinearRegression())    ])full_pipeline_with_predictor.fit(housing, housing_labels)full_pipeline_with_predictor.predict(some_data)

我尝试使用以下代码在测试集上获取预测结果:

X_test_prepared = full_pipeline.transform(X_test)final_predictions = full_pipeline_with_predictor.predict(X_test_prepared)

但我收到了以下错误:

C:\Users\Alex\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py:430: FutureWarning: Given feature/column names or counts do not match the ones for the data given during fit. This will fail from v0.24.  FutureWarning)---------------------------------------------------------------------------Empty                                     Traceback (most recent call last)~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)    796             try:--> 797                 tasks = self._ready_batches.get(block=False)    798             except queue.Empty:~\AppData\Local\Continuum\anaconda3\lib\queue.py in get(self, block, timeout)    166                 if not self._qsize():--> 167                     raise Empty    168             elif timeout is None:Empty: During handling of the above exception, another exception occurred:ValueError                                Traceback (most recent call last)<ipython-input-141-dc87b1c9e658> in <module>      5       6 X_test_prepared = full_pipeline.transform(X_test)----> 7 final_predictions = full_pipeline_with_predictor.predict(X_test_prepared)      8       9 final_mse = mean_squared_error(y_test, final_predictions)~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)    114     115         # lambda, but not partial, allows help() to work with update_wrapper--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)    117         # update the docstring of the returned function    118         update_wrapper(out, self.fn)~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)    417         Xt = X    418         for _, name, transform in self._iter(with_final=False):--> 419             Xt = transform.transform(Xt)    420         return self.steps[-1][-1].predict(Xt, **predict_params)    421 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in transform(self, X)    586     587         self._validate_features(X.shape[1], X_feature_names)--> 588         Xs = self._fit_transform(X, None, _transform_one, fitted=True)    589         self._validate_output(Xs)    590 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in _fit_transform(self, X, y, func, fitted)    455                     message=self._log_message(name, idx, len(transformers)))    456                 for idx, (name, trans, column, weight) in enumerate(--> 457                         self._iter(fitted=fitted, replace_strings=True), 1))    458         except ValueError as e:    459             if "Expected 2D array, got 1D array instead" in str(e):~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)   1002             # remaining jobs.   1003             self._iterating = False-> 1004             if self.dispatch_one_batch(iterator):   1005                 self._iterating = self._original_iterator is not None   1006 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)    806                 big_batch_size = batch_size * n_jobs    807 --> 808                 islice = list(itertools.islice(iterator, big_batch_size))    809                 if len(islice) == 0:    810                     return False~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in <genexpr>(.0)    454                     message_clsname='ColumnTransformer',    455                     message=self._log_message(name, idx, len(transformers)))--> 456                 for idx, (name, trans, column, weight) in enumerate(    457                         self._iter(fitted=fitted, replace_strings=True), 1))    458         except ValueError as e:~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\__init__.py in _safe_indexing(X, indices, axis)    404     if axis == 1 and indices_dtype == 'str' and not hasattr(X, 'loc'):    405         raise ValueError(--> 406             "Specifying the columns using strings is only supported for "    407             "pandas DataFrames"    408         )ValueError: Specifying the columns using strings is only supported for pandas DataFrames

问题: 我如何修正这个错误?为什么会发生这个错误?


回答:

由于您的最终管道:

full_pipeline_with_predictor = Pipeline([        ("preparation", full_pipeline),        ("linear", LinearRegression())    ])

显然已经包含了full_pipeline,您不应该再次“准备”您的X_test;这样做,您实际上是对X_test进行了双重“准备”,这是错误的。因此,您的代码应该简单地是

final_predictions = full_pipeline_with_predictor.predict(X_test)

正如您为some_data获取预测时所做的那样,即

full_pipeline_with_predictor.predict(some_data)

您在将some_data输入到最终管道之前正确地未对其进行“准备”。

使用管道的整个意义就在于此,即避免必须分别运行可能多个准备步骤的fit-predict,而是将所有步骤包装在一个单一的管道中。您在预测some_data时正确地应用了这一过程,但在下一步尝试预测X_test时似乎忘记了这一点。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注