以下是一个多标签分类的简单示例(摘自使用scikit-learn进行多类别分类的问题)
import numpy as npfrom sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm import LinearSVCfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn import preprocessingfrom sklearn.metrics import accuracy_scoreX_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"])y_train_text = [["new york"],["new york"],["new york"],["new york"], ["new york"], ["new york"],["london"],["london"],["london"],["london"], ["london"],["london"],["new york","london"],["new york","london"]]X_test = np.array(['nice day in nyc', 'welcome to london', 'london is rainy', 'it is raining in britian', 'it is raining in britian and the big apple', 'it is raining in britian and nyc', 'hello welcome to new york. enjoy it here and london too'])y_test_text = [["new york"],["london"],["london"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]lb = preprocessing.MultiLabelBinarizer()Y = lb.fit_transform(y_train_text)Y_test = lb.fit_transform(y_test_text)classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(LinearSVC()))])classifier.fit(X_train, Y)predicted = classifier.predict(X_test)print "Accuracy Score: ",accuracy_score(Y_test, predicted)
代码运行正常,并打印出准确率分数,但是如果我将y_test_text更改为
y_test_text = [["new york"],["london"],["england"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]
我会得到
Traceback (most recent call last): File "/Users/scottstewart/Documents/scikittest/example.py", line 52, in <module> print "Accuracy Score: ",accuracy_score(Y_test, predicted) File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 181, in accuracy_scorediffering_labels = count_nonzero(y_true - y_pred, axis=1)File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 393, in __sub__raise ValueError("inconsistent shapes")ValueError: inconsistent shapes
请注意,这里引入了训练集中不存在的“england”标签。我如何使用多标签分类,以便在引入“测试”标签时,仍然可以运行一些度量?或者这是可能的吗?
编辑:感谢大家的回答,我想我的问题更多是关于scikit的二值化器如何工作或应该如何工作。鉴于我的简短示例代码,我还期望如果我将y_test_text更改为
y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]
它应该能工作——我的意思是我们已经为该标签进行了拟合,但在这种情况下我得到
ValueError: Can't handle mix of binary and multilabel-indicator
回答:
你可以,如果你也在训练集y中“引入”新标签,像这样:
import numpy as npfrom sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm import LinearSVCfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn import preprocessingfrom sklearn.metrics import accuracy_scoreX_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"])y_train_text = [["new york"],["new york"],["new york"],["new york"], ["new york"],["new york"],["london"],["london"], ["london"],["london"],["london"],["london"], ["new york","England"],["new york","london"]]X_test = np.array(['nice day in nyc', 'welcome to london', 'london is rainy', 'it is raining in britian', 'it is raining in britian and the big apple', 'it is raining in britian and nyc', 'hello welcome to new york. enjoy it here and london too'])y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))Y = lb.fit_transform(y_train_text)Y_test = lb.fit_transform(y_test_text)print Y_testclassifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(LinearSVC()))])classifier.fit(X_train, Y)predicted = classifier.predict(X_test)print predictedprint "Accuracy Score: ",accuracy_score(Y_test, predicted)
输出:
Accuracy Score: 0.571428571429
关键部分是:
y_train_text = [["new york"],["new york"],["new york"], ["new york"],["new york"],["new york"], ["london"],["london"],["london"],["london"], ["london"],["london"],["new york","England"], ["new york","london"]]
我们也插入了“England”。这是有道理的,因为如果分类器之前没有见过某个标签,它怎么能预测这个标签呢?这样我们就创建了一个三标签分类问题。
编辑:
lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))
你必须将类别作为参数传递给MultiLabelBinarizer()
,这样它就可以处理任何y_test_text了。