GridSearchCV.fit() 返回 TypeError: Expected sequence or array-like, got estimator

我正在尝试按照《Building Machine Learning Systems in Python》一书的第6章对Twitter数据进行情感分析。

我使用的数据集是:https://raw.githubusercontent.com/zfz/twitter_corpus/master/full-corpus.csv

我使用了tfidf向量化器和朴素贝叶斯分类器的管道作为估计器。

然后我使用GridSearchCV()来查找估计器的最佳参数。

代码如下:

from load_data import load_datafrom sklearn.cross_validation import ShuffleSplitfrom sklearn.grid_search import GridSearchCVfrom sklearn.metrics import f1_scorefrom sklearn.naive_bayes import MultinomialNBfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.pipeline import Pipelinedef pipeline_tfidf_nb():    tfidf_vect = TfidfVectorizer( analyzer = "word")    naive_bayes_clf = MultinomialNB()    return Pipeline([('vect', tfidf_vect),('nbclf',naive_bayes_clf)])input_file = "full-corpus.csv"X,y = load_data(input_file)print X.shape,y.shapeclf = pipeline_tfidf_nb()cv = ShuffleSplit(n = len(X), test_size = .3, n_iter = 1, random_state = 0)clf_param_grid = dict(vect__ngram_range = [(1,1),(1,2),(1,3)],                   vect__min_df = [1,2],                    vect__smooth_idf = [False, True],                    vect__use_idf = [False, True],                    vect__sublinear_tf = [False, True],                    vect__binary = [False, True],                    nbclf__alpha = [0, 0.01, 0.05, 0.1, 0.5, 1],                  )grid_search = GridSearchCV(estimator = clf, param_grid = clf_param_grid, cv = cv, scoring = f1_score)grid_search.fit(X, y)print grid_search.best_estimator_

load_data()从csv文件中提取带有正面或负面情感的值。

X是一个字符串数组(TweetText),y是一个布尔值数组(正面情感为True)。

错误信息如下:

runfile('C:/Users/saurabh.s1/Downloads/Python_ml/ch6/main.py', wdir='C:/Users/saurabh.s1/Downloads/Python_ml/ch6')Reloaded modules: load_datanegative : 572positive : 519(1091,) (1091,)Traceback (most recent call last):  File "<ipython-input-25-823b07c4ff26>", line 1, in <module>    runfile('C:/Users/saurabh.s1/Downloads/Python_ml/ch6/main.py', wdir='C:/Users/saurabh.s1/Downloads/Python_ml/ch6')  File "C:\anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile    execfile(filename, namespace)  File "C:\anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile    exec(compile(scripttext, filename, 'exec'), glob, loc)  File "C:/Users/saurabh.s1/Downloads/Python_ml/ch6/main.py", line 31, in <module>    grid_search.fit(X, y)  File "C:\anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit    return self._fit(X, y, ParameterGrid(self.param_grid))  File "C:\anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit    for parameters in parameter_iterable  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__    while self.dispatch_one_batch(iterator):  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch    self._dispatch(tasks)  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch    job = ImmediateComputeBatch(batch)  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__    self.results = batch()  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__    return [func(*args, **kwargs) for func, args, kwargs in self.items]  File "C:\anaconda2\lib\site-packages\sklearn\cross_validation.py", line 1550, in _fit_and_score    test_score = _score(estimator, X_test, y_test, scorer)  File "C:\anaconda2\lib\site-packages\sklearn\cross_validation.py", line 1606, in _score    score = scorer(estimator, X_test, y_test)  File "C:\anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 639, in f1_score    sample_weight=sample_weight)  File "C:\anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 756, in fbeta_score    sample_weight=sample_weight)  File "C:\anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 956, in precision_recall_fscore_support    y_type, y_true, y_pred = _check_targets(y_true, y_pred)  File "C:\anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 72, in _check_targets    check_consistent_length(y_true, y_pred)  File "C:\anaconda2\lib\site-packages\sklearn\utils\validation.py", line 173, in check_consistent_length    uniques = np.unique([_num_samples(X) for X in arrays if X is not None])  File "C:\anaconda2\lib\site-packages\sklearn\utils\validation.py", line 112, in _num_samples    'estimator %s' % x)TypeError: Expected sequence or array-like, got estimator Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',        lowercase=True, max_df=1.0, max_features=None, min_df=1,        ngram_range=(1, 1), norm=u'l2', preprocessor=None,        smooth_i...e_idf=False, vocabulary=None)), ('nbclf', MultinomialNB(alpha=0, class_prior=None, fit_prior=True))])

我已经尝试过重塑X,y,但这不起作用。

如果您需要更多数据,或者我遗漏了什么,请告诉我。

谢谢!


回答:

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注