我在尝试将文本分类为多标签,效果很好,但由于我想考虑预测标签低于0.5的阈值,我将predict()
改为predict_proba()
,以获取所有标签的概率,并根据不同的阈值选择值,但我无法将每个标签的二进制概率值转换为实际的文本标签。以下是可复现的代码:
import numpy as npfrom sklearn.pipeline import Pipelinefrom sklearn.svm import LinearSVCfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn.preprocessing import MultiLabelBinarizerX_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"])y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"], ["new york"],["london"],["london"],["london"],["london"], ["london"],["london"],["new york","london"],["new york","london"]]X_test = np.array(['nice day in nyc', 'welcome to london', 'london is rainy', 'it is raining in britian', 'it is raining in britian and the big apple', 'it is raining in britian and nyc', 'hello welcome to new york. enjoy it here and london too'])target_names = ['New York', 'London']
lb = MultiLabelBinarizer()Y = lb.fit_transform(y_train_text)classifier = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(LinearSVC()))])classifier.fit(X_train, Y)predicted = classifier.predict_proba(X_test)
这为我提供了每个X_test值的标签概率值。现在,当我尝试使用lb.inverse_transform(predicted[0])
来获取第一个X_test的实际标签时,它不起作用。
各位有什么帮助吗?我做错了什么?如何获得所需的结果?
注意:以上是虚拟数据,但我有500个标签
,每个特定文本最多可以有不超过5个标签
。
回答:
我尝试通过获取多标签类别
和预测概率
的索引
并匹配它们来获得实际标签,因为sklearn中没有直接的方法。
这是我如何做的。
multilabel = MultiLabelBinarizer()y = multilabel.fit_transform('target_labels')predicted_list = classifier.predict_proba(X_test)def get_labels(predicted_list): mlb =[(i1,c1)for i1, c1 in enumerate(multilabel.classes_)] temp_list = sorted([(i,c) for i,c in enumerate(list(predicted_list))],key = lambda x: x[1], reverse=True) tag_list = [item1 for item1 in temp_list if item1[1]>=0.35] # 这里0.35是我选择的阈值 tags = [item[1] for item2 in tag_list[:5] for item in mlb if item2[0] == item[0] ] # 这里我选择只获取前5个标签,如果有更多的话 return tags
get_labels(predicted_list[0]) >> ['New York']