我正在进行一个文本分类项目。
在探索不同的分类器时,我遇到了XGBClassifier
我的分类任务是多类分类。当我尝试对分类器进行评分时,出现了上述错误 – 我猜想需要进行一些形状调整,但我无法理解为什么。这对我来说很奇怪,因为其他分类器都能正常工作(即使是这个分类器使用默认参数时也是如此)
这是我代码中的相关部分:
algorithms = [ svm.LinearSVC(), # <<<=== 有效 linear_model.RidgeClassifier(), # <<<=== 有效 XGBClassifier(), # <<<=== 有效 XGBClassifier(objective='multi:softprob', num_class=len(groups_count_dict), eval_metric='merror') # <<<=== 无效]def train(algorithm, X_train, y_train): model = Pipeline([ ('vect', transformer), ('classifier', OneVsRestClassifier(algorithm)) ]) model.fit(X_train, y_train) return modelscore_dict = {}algorithm_to_model_dict = {}for algorithm in algorithms: print() print(f'trying {algorithm}') model = train(algorithm, X_train, y_train) score = model.score(X_test, y_test) score_dict[algorithm] = int(score * 100) algorithm_to_model_dict[algorithm] = model sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}for classifier, score in sorted_score_dict.items(): print(f'{classifier.__class__.__name__}: score is {score}%')
再次显示错误:
ValueError: operands could not be broadcast together with shapes (2557,) (8,) (2557,)
不确定是否相关,但我还是要提一下 – 我的transformer
是通过以下方式创建的:
tuples = []tfidf_kwargs = {'ngram_range': (1, 2), 'stop_words': 'english', 'sublinear_tf': True}for col in list(features.columns): tuples.append((f'vec_{col}', TfidfVectorizer(**tfidf_kwargs), col))transformer = ColumnTransformer(tuples, remainder='passthrough')
提前感谢
编辑:
添加完整的跟踪信息:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-15-576cd62f3df0> in <module> 84 print(f'trying {algorithm}') 85 model = train(algorithm, X_train, y_train)---> 86 score = model.score(X_test, y_test) 87 score_dict[algorithm] = int(score * 100) 88 algorithm_to_model_dict[algorithm] = model/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs) 118 119 # lambda, but not partial, allows help() to work with update_wrapper--> 120 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs) 121 # update the docstring of the returned function 122 update_wrapper(out, self.fn)/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/pipeline.py in score(self, X, y, sample_weight) 620 if sample_weight is not None: 621 score_params['sample_weight'] = sample_weight--> 622 return self.steps[-1][-1].score(Xt, y, **score_params) 623 624 @property/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/base.py in score(self, X, y, sample_weight) 498 """ 499 from .metrics import accuracy_score--> 500 return accuracy_score(y, self.predict(X), sample_weight=sample_weight) 501 502 def _more_tags(self):/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/multiclass.py in predict(self, X) 365 for i, e in enumerate(self.estimators_): 366 pred = _predict_binary(e, X)--> 367 np.maximum(maxima, pred, out=maxima) 368 argmaxima[maxima == pred] = i 369 return self.classes_[argmaxima]ValueError: operands could not be broadcast together with shapes (2557,) (8,) (2557,)
打印X_test
和y_test
的形状得到:(2557, 12) (2557,)
我能够理解(8,)
的来源 – 它是groups_count_dict
的长度
回答:
结果发现解决方案是从管道中移除OneVsRestClassifier
的使用