使用Spacy的自定义类在Sklearn Pipeline中使用ColumnTransformer时出现ValueError – 使用GloveVectorizer

我有一个包含多个文本列和一个目标列的数据集。我试图使用Spacy的自定义类来使用Glove嵌入处理我的文本列，并且还尝试使用Pipeline来实现。但是我得到了一个ValueError。以下是我的代码：

data_features = df.copy()[["title", "description"]]train_data, test_data, train_target, test_target = train_test_split(data_features, df['target'], test_size = 0.1)

我创建了这个自定义类来使用glove嵌入。我从这个教程中获取了代码。

class SpacyVectorTransformer(BaseEstimator, TransformerMixin):    def __init__(self, nlp):        self.nlp = nlp        self.dim = 300    def fit(self, X, y):        return self    def transform(self, X):        return [self.nlp(text).vector for text in X]

加载nlp模型：

nlp = spacy.load("en_core_web_sm")

这是我在pipeline中尝试使用的列转换器：

col_preprocessor = ColumnTransformer(        [            ('title_glove', SpacyVectorTransformer(nlp), 'title'),            ('description_glove', SpacyVectorTransformer(nlp), 'description'),        ],        remainder='drop',        n_jobs=1        )

这是我的pipeline：

pipeline_glove = Pipeline([    ('col_preprocessor', col_preprocessor),     ('classifier', LogisticRegression())])

当我运行fit方法时，我得到了以下错误：

pipeline_glove.fit(train_data, train_target)

错误：

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-219-8543ea744205> in <module>----> 1 pipeline_glove.fit(train_data, train_target)/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)    328         """    329         fit_params_steps = self._check_fit_params(**fit_params)--> 330         Xt = self._fit(X, y, **fit_params_steps)    331         with _print_elapsed_time('Pipeline',    332                                  self._log_message(len(self.steps) - 1)):/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)    294                 message_clsname='Pipeline',    295                 message=self._log_message(step_idx),--> 296                 **fit_params_steps[name])    297             # Replace the transformer of the step with the fitted    298             # transformer. This is necessary when loading the transformer/opt/conda/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)    353     354     def __call__(self, *args, **kwargs):--> 355         return self.func(*args, **kwargs)    356     357     def call_and_shelve(self, *args, **kwargs):/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)    738     with _print_elapsed_time(message_clsname, message):    739         if hasattr(transformer, 'fit_transform'):--> 740             res = transformer.fit_transform(X, y, **fit_params)    741         else:    742             res = transformer.fit(X, y, **fit_params).transform(X)/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)    549     550         self._update_fitted_transformers(transformers)--> 551         self._validate_output(Xs)    552     553         return self._hstack(list(Xs))/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _validate_output(self, result)    410                 raise ValueError(    411                     "The output of the '{0}' transformer should be 2D (scipy "--> 412                     "matrix, array, or pandas DataFrame).".format(name))    413     414     def _validate_features(self, n_features, feature_names):ValueError: The output of the 'title_glove' transformer should be 2D (scipy matrix, array, or pandas DataFrame).

回答：

错误消息告诉您需要修复的内容。

ValueError: The output of the ‘title_glove’ transformer should be 2D(scipy matrix, array, or pandas DataFrame).

但是，您当前的转换器（SpacyVectorTransformer）返回的是一个列表。您可以通过将其转换为pandas DataFrame来修复它，例如这样做：

import pandas as pdclass SpacyVectorTransformer(BaseEstimator, TransformerMixin):    def __init__(self, nlp):        self.nlp = nlp        self.dim = 300    def fit(self, X, y):        return self    def transform(self, X):        return pd.DataFrame([self.nlp(text).vector for text in X])

下次，请提供一个最小可复现的示例。在您提供的代码中，没有导入，也没有名为”df”的DataFrame。

学技术

使用Spacy的自定义类在Sklearn Pipeline中使用ColumnTransformer时出现ValueError – 使用GloveVectorizer

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复