无法对多标签分类器进行堆叠

我正在处理一个多标签文本分类问题（总目标标签90个）。数据分布具有长尾和类别不平衡，约有10万条记录。我使用的是OAA策略（一对所有）。我尝试通过堆叠来创建一个集成模型。

文本特征：HashingVectorizer（特征数量为2**20，使用字符分析器）
使用TSVD来降低维度（n_components=200）。

text_pipeline = Pipeline([    ('hashing_vectorizer', HashingVectorizer(n_features=2**20,                                             analyzer='char')),    ('svd', TruncatedSVD(algorithm='randomized',                         n_components=200, random_state=19204))])feat_pipeline = FeatureUnion([('text', text_pipeline)])estimators_list = [('ExtraTrees',                    OneVsRestClassifier(ExtraTreesClassifier(n_estimators=30,                                                             class_weight="balanced",                                                             random_state=4621))),                   ('linearSVC',                    OneVsRestClassifier(LinearSVC(class_weight='balanced')))]estimators_ensemble = StackingClassifier(estimators=estimators_list,                                         final_estimator=OneVsRestClassifier(                                             LogisticRegression(solver='lbfgs',                                                                max_iter=300)))classifier_pipeline = Pipeline([    ('features', feat_pipeline),    ('clf', estimators_ensemble)])

错误

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-41-ad4e769a0a78> in <module>()      1 start = time.time()----> 2 classifier_pipeline.fit(X_train.values, y_train_encoded)      3 print(f"Execution time {time.time()-start}")      4 3 frames/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)    795         return np.ravel(y)    796 --> 797     raise ValueError("bad input shape {0}".format(shape))    798     799 ValueError: bad input shape (89792, 83)

回答：

StackingClassifier目前不支持多标签分类。你可以通过查看fit参数的形状值来理解这些功能，例如这里。

解决方案是在StackingClassifier上而不是在各个模型上应用OneVsRestClassifier包装器。

示例：

from sklearn.datasets import make_multilabel_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import ExtraTreesClassifierfrom sklearn.svm import LinearSVCfrom sklearn.ensemble import StackingClassifierfrom sklearn.multiclass import OneVsRestClassifierX, y = make_multilabel_classification(n_classes=3, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y,                                                     test_size=0.33,                                                    random_state=42)estimators_list = [('ExtraTrees', ExtraTreesClassifier(n_estimators=30,                                                        class_weight="balanced",                                                        random_state=4621)),                   ('linearSVC', LinearSVC(class_weight='balanced'))]estimators_ensemble = StackingClassifier(estimators=estimators_list,                                         final_estimator = LogisticRegression(solver='lbfgs', max_iter=300))ovr_model = OneVsRestClassifier(estimators_ensemble)ovr_model.fit(X_train, y_train)ovr_model.score(X_test, y_test)# 0.45454545454545453from sklearn.metrics import confusion_matrixconfusion_matrix(    y_train[:, 0],     ovr_model.estimators_[0].estimators_[0].predict(X_train),)#array([[818,   0],#       [  0, 522]])ovr_model.estimators_[0].estimators_[0].feature_importances_#array([0.05049793, 0.07232525, 0.05278524, 0.08005984, 0.05036507,#       0.03674032, 0.06144285, 0.03473714, 0.04080104, 0.05120309,#       0.05311589, 0.04119592, 0.03239608, 0.08101098, 0.03522335,#       0.03676684, 0.04613645, 0.04755277, 0.05268342, 0.04296053])

学技术

无法对多标签分类器进行堆叠

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复