将多标签的概率二进制值转换为目标标签

我在尝试将文本分类为多标签,效果很好,但由于我想考虑预测标签低于0.5的阈值,我将predict()改为predict_proba(),以获取所有标签的概率,并根据不同的阈值选择值,但我无法将每个标签的二进制概率值转换为实际的文本标签。以下是可复现的代码:

import numpy as npfrom sklearn.pipeline import Pipelinefrom sklearn.svm import LinearSVCfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn.preprocessing import MultiLabelBinarizerX_train = np.array(["new york is a hell of a town",                "new york was originally dutch",                "the big apple is great",                "new york is also called the big apple",                "nyc is nice",                "people abbreviate new york city as nyc",                "the capital of great britain is london",                "london is in the uk",                "london is in england",                "london is in great britain",                "it rains a lot in london",                "london hosts the british museum",                "new york is great and so is london",                "i like london better than new york"])y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],            ["new york"],["london"],["london"],["london"],["london"],            ["london"],["london"],["new york","london"],["new york","london"]]X_test = np.array(['nice day in nyc',               'welcome to london',               'london is rainy',               'it is raining in britian',               'it is raining in britian and the big apple',               'it is raining in britian and nyc',               'hello welcome to new york. enjoy it here and london too'])target_names = ['New York', 'London']
lb = MultiLabelBinarizer()Y = lb.fit_transform(y_train_text)classifier = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(LinearSVC()))])classifier.fit(X_train, Y)predicted = classifier.predict_proba(X_test)

这为我提供了每个X_test值的标签概率值。现在,当我尝试使用lb.inverse_transform(predicted[0])来获取第一个X_test的实际标签时,它不起作用。

各位有什么帮助吗?我做错了什么?如何获得所需的结果?

注意:以上是虚拟数据,但我有500个标签,每个特定文本最多可以有不超过5个标签


回答:

我尝试通过获取多标签类别预测概率索引并匹配它们来获得实际标签,因为sklearn中没有直接的方法。

这是我如何做的。

multilabel = MultiLabelBinarizer()y = multilabel.fit_transform('target_labels')predicted_list = classifier.predict_proba(X_test)def get_labels(predicted_list):    mlb =[(i1,c1)for i1, c1 in enumerate(multilabel.classes_)]        temp_list = sorted([(i,c) for i,c in enumerate(list(predicted_list))],key = lambda x: x[1], reverse=True)    tag_list = [item1 for item1 in temp_list if item1[1]>=0.35] # 这里0.35是我选择的阈值    tags = [item[1] for item2 in tag_list[:5] for item in mlb if item2[0] == item[0] ] # 这里我选择只获取前5个标签,如果有更多的话    return tags
get_labels(predicted_list[0]) >> ['New York']

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注