如何为tf-idf向量化器创建scikit流水线?

我在阅读这个官方的sklearn 教程,了解如何为文本数据分析创建流水线并随后用于网格搜索。但我遇到了一个问题,提供的方法在这个案例中不起作用。

我想让这段代码工作:

import numpy as npimport pandas as pdfrom sklearn.pipeline import Pipelinefrom mlxtend.feature_selection import ColumnSelectorfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.naive_bayes import BernoulliNBfrom sklearn.feature_extraction.text import TfidfVectorizerdf_Xtrain = pd.DataFrame({'tweet': ['This is a tweet']*10,                          'label': 0})y_train = df_Xtrain['label'].to_numpy().ravel()pipe = Pipeline([    ('col_selector', ColumnSelector(cols=('tweet'))),    ('tfidf', TfidfTransformer()),    ('bernoulli', BernoulliNB()),])pipe.fit(df_Xtrain,y_train)

这段代码可以工作:

import numpy as npimport pandas as pdfrom sklearn.pipeline import Pipelinefrom mlxtend.feature_selection import ColumnSelectorfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.naive_bayes import BernoulliNBfrom sklearn.feature_extraction.text import TfidfVectorizer# datadf_Xtrain = pd.DataFrame({'tweet': ['This is a tweet']*10,                          'label': 0})y_train = df_Xtrain['label'].to_numpy().ravel()# modellingmc = 'tweet'vec_tfidf = TfidfVectorizer()vec_tfidf.fit(df_Xtrain[mc])X_train = vec_tfidf.transform(df_Xtrain[mc]).toarray()model = BernoulliNB()model.fit(X_train,y_train)model.predict(X_train)model.score(X_train,y_train)

问题

如何像上面那样为文本分析创建一个流水线?

更新

版本

[('numpy', '1.17.5'), ('pandas', '1.0.5'), ('sklearn', '0.23.1'), ('mlxtend', '0.17.0')]Python 3.7.7

错误日志

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-1-3012ce7245d9> in <module>     19      20 ---> 21 pipe.fit(df_Xtrain,y_train)~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)    328         """    329         fit_params_steps = self._check_fit_params(**fit_params)--> 330         Xt = self._fit(X, y, **fit_params_steps)    331         with _print_elapsed_time('Pipeline',    332                                  self._log_message(len(self.steps) - 1)):~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)    294                 message_clsname='Pipeline',    295                 message=self._log_message(step_idx),--> 296                 **fit_params_steps[name])    297             # Replace the transformer of the step with the fitted    298             # transformer. This is necessary when loading the transformer~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)    350     351     def __call__(self, *args, **kwargs):--> 352         return self.func(*args, **kwargs)    353     354     def call_and_shelve(self, *args, **kwargs):~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)    738     with _print_elapsed_time(message_clsname, message):    739         if hasattr(transformer, 'fit_transform'):--> 740             res = transformer.fit_transform(X, y, **fit_params)    741         else:    742             res = transformer.fit(X, y, **fit_params).transform(X)~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)    691         else:    692             # fit method of arity 2 (supervised transformation)--> 693             return self.fit(X, y, **fit_params).transform(X)    694     695 ~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit(self, X, y)   1429             A matrix of term/token counts.   1430         """-> 1431         X = check_array(X, accept_sparse=('csr', 'csc'))   1432         if not sp.issparse(X):   1433             X = sp.csr_matrix(X)~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)     71                           FutureWarning)     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})---> 73         return f(**kwargs)     74     return inner_f     75 ~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)    597                     array = array.astype(dtype, casting="unsafe", copy=False)    598                 else:--> 599                     array = np.asarray(array, order=order, dtype=dtype)    600             except ComplexWarning:    601                 raise ValueError("Complex data not supported\n"~/opt/miniconda3/envs/spk/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)     83      84     """---> 85     return array(a, dtype, copy=False, order=order)     86      87 ValueError: could not convert string to float: 'This is a tweet'

回答:

你的代码有两个主要问题 –

  1. 你在使用tfidftransformer,但没有在它之前使用countvectorizer。相反,只需使用tfidfvectorizer,它可以一次完成这两步操作。
  2. 你的columnselector返回的是一个二维数组(n,1),而tfidfvectorizer期望的是一个一维数组(n,)。可以通过设置参数drop_axis = True来解决这个问题。

进行上述更改后,这应该可以工作 –

import numpy as npimport pandas as pdfrom sklearn.pipeline import Pipelinefrom mlxtend.feature_selection import ColumnSelectorfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import BernoulliNBdf_Xtrain = pd.DataFrame({'tweet': ['This is a tweet']*10,                          'label': 0})y_train = df_Xtrain['label'].to_numpy().ravel()pipe = Pipeline([    ('col_selector', ColumnSelector(cols=('tweet'),drop_axis=True)),    ('tfidf', TfidfVectorizer()),    ('bernoulli', BernoulliNB()),])pipe.fit(df_Xtrain,y_train)
Pipeline(steps=[('col_selector', ColumnSelector(cols='tweet', drop_axis=True)),                ('tfidf', TfidfVectorizer()), ('bernoulli', BernoulliNB())])

编辑:对提出的问题作出回应 – “是否可以不使用mlxtend包?为什么我需要在这里使用ColumnSelector?有没有只用sklearn的解决方案?”

是的,正如我在下面提到的,你需要构建自己的列选择器类(这也是你如何构建自己的变换器并添加到流水线中的方法)。

class SelectColumnsTransformer():    def __init__(self, columns=None):        self.columns = columns    def transform(self, X, **transform_params):        cpy_df = X[self.columns].copy()        return cpy_df    def fit(self, X, y=None, **fit_params):        return self# 将其添加到流水线中pipe = Pipeline([    ('selector', SelectColumnsTransformer([<在这里输入列名>]))])

有关如何操作的更多信息,请参考这个链接

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注