我正在处理一个多标签文本分类问题(总目标标签90个)。数据分布具有长尾和类别不平衡,约有10万条记录。我使用的是OAA策略(一对所有)。我尝试通过堆叠来创建一个集成模型。
文本特征:HashingVectorizer
(特征数量为2**20,使用字符分析器)
使用TSVD来降低维度(n_components=200)。
text_pipeline = Pipeline([ ('hashing_vectorizer', HashingVectorizer(n_features=2**20, analyzer='char')), ('svd', TruncatedSVD(algorithm='randomized', n_components=200, random_state=19204))])feat_pipeline = FeatureUnion([('text', text_pipeline)])estimators_list = [('ExtraTrees', OneVsRestClassifier(ExtraTreesClassifier(n_estimators=30, class_weight="balanced", random_state=4621))), ('linearSVC', OneVsRestClassifier(LinearSVC(class_weight='balanced')))]estimators_ensemble = StackingClassifier(estimators=estimators_list, final_estimator=OneVsRestClassifier( LogisticRegression(solver='lbfgs', max_iter=300)))classifier_pipeline = Pipeline([ ('features', feat_pipeline), ('clf', estimators_ensemble)])
错误
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-41-ad4e769a0a78> in <module>() 1 start = time.time()----> 2 classifier_pipeline.fit(X_train.values, y_train_encoded) 3 print(f"Execution time {time.time()-start}") 4 3 frames/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn) 795 return np.ravel(y) 796 --> 797 raise ValueError("bad input shape {0}".format(shape)) 798 799 ValueError: bad input shape (89792, 83)
回答:
StackingClassifier
目前不支持多标签分类。你可以通过查看fit
参数的形状值来理解这些功能,例如这里。
解决方案是在StackingClassifier
上而不是在各个模型上应用OneVsRestClassifier
包装器。
示例:
from sklearn.datasets import make_multilabel_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import ExtraTreesClassifierfrom sklearn.svm import LinearSVCfrom sklearn.ensemble import StackingClassifierfrom sklearn.multiclass import OneVsRestClassifierX, y = make_multilabel_classification(n_classes=3, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)estimators_list = [('ExtraTrees', ExtraTreesClassifier(n_estimators=30, class_weight="balanced", random_state=4621)), ('linearSVC', LinearSVC(class_weight='balanced'))]estimators_ensemble = StackingClassifier(estimators=estimators_list, final_estimator = LogisticRegression(solver='lbfgs', max_iter=300))ovr_model = OneVsRestClassifier(estimators_ensemble)ovr_model.fit(X_train, y_train)ovr_model.score(X_test, y_test)# 0.45454545454545453from sklearn.metrics import confusion_matrixconfusion_matrix( y_train[:, 0], ovr_model.estimators_[0].estimators_[0].predict(X_train),)#array([[818, 0],# [ 0, 522]])ovr_model.estimators_[0].estimators_[0].feature_importances_#array([0.05049793, 0.07232525, 0.05278524, 0.08005984, 0.05036507,# 0.03674032, 0.06144285, 0.03473714, 0.04080104, 0.05120309,# 0.05311589, 0.04119592, 0.03239608, 0.08101098, 0.03522335,# 0.03676684, 0.04613645, 0.04755277, 0.05268342, 0.04296053])