我在阅读这个官方的sklearn 教程,了解如何为文本数据分析创建流水线并随后用于网格搜索。但我遇到了一个问题,提供的方法在这个案例中不起作用。
我想让这段代码工作:
import numpy as npimport pandas as pdfrom sklearn.pipeline import Pipelinefrom mlxtend.feature_selection import ColumnSelectorfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.naive_bayes import BernoulliNBfrom sklearn.feature_extraction.text import TfidfVectorizerdf_Xtrain = pd.DataFrame({'tweet': ['This is a tweet']*10, 'label': 0})y_train = df_Xtrain['label'].to_numpy().ravel()pipe = Pipeline([ ('col_selector', ColumnSelector(cols=('tweet'))), ('tfidf', TfidfTransformer()), ('bernoulli', BernoulliNB()),])pipe.fit(df_Xtrain,y_train)
这段代码可以工作:
import numpy as npimport pandas as pdfrom sklearn.pipeline import Pipelinefrom mlxtend.feature_selection import ColumnSelectorfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.naive_bayes import BernoulliNBfrom sklearn.feature_extraction.text import TfidfVectorizer# datadf_Xtrain = pd.DataFrame({'tweet': ['This is a tweet']*10, 'label': 0})y_train = df_Xtrain['label'].to_numpy().ravel()# modellingmc = 'tweet'vec_tfidf = TfidfVectorizer()vec_tfidf.fit(df_Xtrain[mc])X_train = vec_tfidf.transform(df_Xtrain[mc]).toarray()model = BernoulliNB()model.fit(X_train,y_train)model.predict(X_train)model.score(X_train,y_train)
问题
如何像上面那样为文本分析创建一个流水线?
更新
版本
[('numpy', '1.17.5'), ('pandas', '1.0.5'), ('sklearn', '0.23.1'), ('mlxtend', '0.17.0')]Python 3.7.7
错误日志
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-1-3012ce7245d9> in <module> 19 20 ---> 21 pipe.fit(df_Xtrain,y_train)~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 328 """ 329 fit_params_steps = self._check_fit_params(**fit_params)--> 330 Xt = self._fit(X, y, **fit_params_steps) 331 with _print_elapsed_time('Pipeline', 332 self._log_message(len(self.steps) - 1)):~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps) 294 message_clsname='Pipeline', 295 message=self._log_message(step_idx),--> 296 **fit_params_steps[name]) 297 # Replace the transformer of the step with the fitted 298 # transformer. This is necessary when loading the transformer~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs) 350 351 def __call__(self, *args, **kwargs):--> 352 return self.func(*args, **kwargs) 353 354 def call_and_shelve(self, *args, **kwargs):~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params) 738 with _print_elapsed_time(message_clsname, message): 739 if hasattr(transformer, 'fit_transform'):--> 740 res = transformer.fit_transform(X, y, **fit_params) 741 else: 742 res = transformer.fit(X, y, **fit_params).transform(X)~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params) 691 else: 692 # fit method of arity 2 (supervised transformation)--> 693 return self.fit(X, y, **fit_params).transform(X) 694 695 ~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit(self, X, y) 1429 A matrix of term/token counts. 1430 """-> 1431 X = check_array(X, accept_sparse=('csr', 'csc')) 1432 if not sp.issparse(X): 1433 X = sp.csr_matrix(X)~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs) 71 FutureWarning) 72 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})---> 73 return f(**kwargs) 74 return inner_f 75 ~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 597 array = array.astype(dtype, casting="unsafe", copy=False) 598 else:--> 599 array = np.asarray(array, order=order, dtype=dtype) 600 except ComplexWarning: 601 raise ValueError("Complex data not supported\n"~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order) 83 84 """---> 85 return array(a, dtype, copy=False, order=order) 86 87 ValueError: could not convert string to float: 'This is a tweet'
回答:
你的代码有两个主要问题 –
- 你在使用
tfidftransformer
,但没有在它之前使用countvectorizer
。相反,只需使用tfidfvectorizer
,它可以一次完成这两步操作。 - 你的
columnselector
返回的是一个二维数组(n,1)
,而tfidfvectorizer
期望的是一个一维数组(n,)
。可以通过设置参数drop_axis = True
来解决这个问题。
进行上述更改后,这应该可以工作 –
import numpy as npimport pandas as pdfrom sklearn.pipeline import Pipelinefrom mlxtend.feature_selection import ColumnSelectorfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import BernoulliNBdf_Xtrain = pd.DataFrame({'tweet': ['This is a tweet']*10, 'label': 0})y_train = df_Xtrain['label'].to_numpy().ravel()pipe = Pipeline([ ('col_selector', ColumnSelector(cols=('tweet'),drop_axis=True)), ('tfidf', TfidfVectorizer()), ('bernoulli', BernoulliNB()),])pipe.fit(df_Xtrain,y_train)
Pipeline(steps=[('col_selector', ColumnSelector(cols='tweet', drop_axis=True)), ('tfidf', TfidfVectorizer()), ('bernoulli', BernoulliNB())])
编辑:对提出的问题作出回应 – “是否可以不使用mlxtend包?为什么我需要在这里使用ColumnSelector?有没有只用sklearn的解决方案?”
是的,正如我在下面提到的,你需要构建自己的列选择器类(这也是你如何构建自己的变换器并添加到流水线中的方法)。
class SelectColumnsTransformer(): def __init__(self, columns=None): self.columns = columns def transform(self, X, **transform_params): cpy_df = X[self.columns].copy() return cpy_df def fit(self, X, y=None, **fit_params): return self# 将其添加到流水线中pipe = Pipeline([ ('selector', SelectColumnsTransformer([<在这里输入列名>]))])
有关如何操作的更多信息,请参考这个链接。