我在尝试在GridSearch中使用Pipeline时使用多个特征列。因此,我传递了两个我想进行TfidfVectorizer处理的列,但在运行GridSearch时遇到了问题。
Xs = training_data.loc[:,['text','path_contents']]y = training_data['class_recoded'].astype('int32')for col in Xs: print Xs[col].shapeprint Xs.shapeprint y.shape# (2464L,)# (2464L,)# (2464, 2)# (2464L,)from sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import GridSearchCVpipeline = Pipeline([('vectorizer', TfidfVectorizer(encoding="cp1252", stop_words="english")), ('nb', MultinomialNB())])parameters = { 'vectorizer__max_df': (0.48, 0.5, 0.52,), 'vectorizer__max_features': (None, 8500, 9000, 9500), 'vectorizer__ngram_range': ((1, 3), (1, 4), (1, 5)), 'vectorizer__use_idf': (False, True)}if __name__ == "__main__": grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2) grid_search.fit(Xs, y) # <- error thrown here print("Best score: {0}".format(grid_search.best_score_)) print("Best parameters set:") best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(list(parameters.keys())): print("\t{0}: {1}".format(param_name, best_parameters[param_name]))
错误:ValueError: Found input variables with inconsistent numbers of samples: [2, 1642]
我读到了类似的错误这里和这里,我尝试了两个问题的建议,但没有效果。
我尝试以不同的方式选择我的数据:
features = ['text', 'path_contents']Xs = training_data[features]
我尝试使用.values
代替,如这里建议的,像这样:
grid_search.fit(Xs.values, y.values)
但这给了我以下错误:
AttributeError: ‘numpy.ndarray’ object has no attribute ‘lower’
这是怎么回事?我不知道如何继续进行下去。
回答:
TfidfVectorizer期望输入一个字符串列表。这解释了”AttributeError: ‘numpy.ndarray’ object has no attribute ‘lower'”,因为你输入的是二维数组,这意味着一个数组列表。
所以你有两种选择,要么事先将两列合并成一列(在pandas中),或者如果你想保留两列,你可以在pipeline中使用feature union(http://scikit-learn.org/stable/modules/pipeline.html#feature-union)
关于第一个异常,我猜它是由pandas和sklearn之间的通信引起的。然而,由于上述代码中的错误,你无法确定原因。