我有一个包含多个文本列和一个目标列的数据集。我试图使用Spacy的自定义类来使用Glove嵌入处理我的文本列,并且还尝试使用Pipeline来实现。但是我得到了一个ValueError。以下是我的代码:
data_features = df.copy()[["title", "description"]]train_data, test_data, train_target, test_target = train_test_split(data_features, df['target'], test_size = 0.1)
我创建了这个自定义类来使用glove嵌入。我从这个教程中获取了代码。
class SpacyVectorTransformer(BaseEstimator, TransformerMixin): def __init__(self, nlp): self.nlp = nlp self.dim = 300 def fit(self, X, y): return self def transform(self, X): return [self.nlp(text).vector for text in X]
加载nlp模型:
nlp = spacy.load("en_core_web_sm")
这是我在pipeline中尝试使用的列转换器:
col_preprocessor = ColumnTransformer( [ ('title_glove', SpacyVectorTransformer(nlp), 'title'), ('description_glove', SpacyVectorTransformer(nlp), 'description'), ], remainder='drop', n_jobs=1 )
这是我的pipeline:
pipeline_glove = Pipeline([ ('col_preprocessor', col_preprocessor), ('classifier', LogisticRegression())])
当我运行fit方法时,我得到了以下错误:
pipeline_glove.fit(train_data, train_target)
错误:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-219-8543ea744205> in <module>----> 1 pipeline_glove.fit(train_data, train_target)/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 328 """ 329 fit_params_steps = self._check_fit_params(**fit_params)--> 330 Xt = self._fit(X, y, **fit_params_steps) 331 with _print_elapsed_time('Pipeline', 332 self._log_message(len(self.steps) - 1)):/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps) 294 message_clsname='Pipeline', 295 message=self._log_message(step_idx),--> 296 **fit_params_steps[name]) 297 # Replace the transformer of the step with the fitted 298 # transformer. This is necessary when loading the transformer/opt/conda/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs) 353 354 def __call__(self, *args, **kwargs):--> 355 return self.func(*args, **kwargs) 356 357 def call_and_shelve(self, *args, **kwargs):/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params) 738 with _print_elapsed_time(message_clsname, message): 739 if hasattr(transformer, 'fit_transform'):--> 740 res = transformer.fit_transform(X, y, **fit_params) 741 else: 742 res = transformer.fit(X, y, **fit_params).transform(X)/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y) 549 550 self._update_fitted_transformers(transformers)--> 551 self._validate_output(Xs) 552 553 return self._hstack(list(Xs))/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _validate_output(self, result) 410 raise ValueError( 411 "The output of the '{0}' transformer should be 2D (scipy "--> 412 "matrix, array, or pandas DataFrame).".format(name)) 413 414 def _validate_features(self, n_features, feature_names):ValueError: The output of the 'title_glove' transformer should be 2D (scipy matrix, array, or pandas DataFrame).
回答:
错误消息告诉您需要修复的内容。
ValueError: The output of the ‘title_glove’ transformer should be 2D(scipy matrix, array, or pandas DataFrame).
但是,您当前的转换器(SpacyVectorTransformer)返回的是一个列表。您可以通过将其转换为pandas DataFrame来修复它,例如这样做:
import pandas as pdclass SpacyVectorTransformer(BaseEstimator, TransformerMixin): def __init__(self, nlp): self.nlp = nlp self.dim = 300 def fit(self, X, y): return self def transform(self, X): return pd.DataFrame([self.nlp(text).vector for text in X])
下次,请提供一个最小可复现的示例。在您提供的代码中,没有导入,也没有名为”df”的DataFrame。