预测概率中的前三类

我正在处理一个多类文本分类问题，需要得到前三个预测标签及其相应的概率。我能够使用sklearn predict_proba()，但是在将输出格式化成表A那样的形式时遇到了困难。我的代码如下：

cv = StratifiedKFold(n_splits = 10, random_state = 42, shuffle = None)pipeline_sgd = Pipeline([     ('vect', CountVectorizer()),     ('tfdif', TfidfTransformer()),     ('nb', CalibratedClassifierCV(base_estimator = SGDClassifier(), cv=cv)),])Model = pipeline_sgd.fit(X_train, y_train)n_top_labels = 3probas = model.predict_probas(test["text"])top_n_lables_idx = probas.argsort()[::-1][:n_top_lables]top_n_probs = probas[top_n_lables_idx]top_n_labels = label_encoder.inverse_transform(top_n_lables_idx.ravel())results = list(zip(top_n_labels, top_n_probas))

输出:

[(A, .80), (B, .10), (C, .10)]

我对上述输出的挑战在于，它并未为每行文本提供前三个标签/概率。例如，当我在新的一组文档（文本）上运行推理时，我只得到一个输出，而不是每个文档（行）的一个输出。

我面临的第二个挑战是，当我使用pd.Dataframe(data = results)将其插入数据框时，我得到以下结果：

|   | 0 | 1               ||---|---|-----------------|| 0 | A | [[.80,.10,.10]] || 1 | B | [[.85,.10,.05]] || 2 | C | [[.70,.20,.10]] |

它应该为:

|   | 0     | 1               ||---|-------|-----------------|| 0 | A,B,C | [[.80,.10,.10]] || 1 | B,C,A | [[.85,.10,.05]] || 2 | C,B,A | [[.70,.20,.10]] |

表A

| Text                                       | Predicted labels | Probabilities  ||--------------------------------------------|------------------|----------------|| Hello  World!                              | A,B,C            | [.80,.10,10]   || Have a nice Day!                           | B,C,A            | [.90,.05,05]   || It's a wonderful day in the neighborhood.  | C,A,B            | [.80,.10,10]   |

回答：

当我运行你的代码时，top_n_probs的形状非常奇怪，我很难将标签找回来。用于调用排序值的argsort和代码看起来有点奇怪。

下面我写了一个快速实现，应该能工作。

使用一个示例数据集：

from sklearn.model_selection import StratifiedKFoldfrom sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import CountVectorizer,TfidfTransformerfrom sklearn.calibration import CalibratedClassifierCVfrom sklearn.linear_model import SGDClassifierimport pandas as pdimport numpy as npdf = pd.read_csv('./smsspamcollection//SMSSpamCollection', sep='\t', names=["label", "message"])df['label'][df['label']=='ham'] = np.random.choice(['hamA','hamB'],np.sum(df['label']=='ham'))X_train = df['message']y_train = df['label']

我的标签看起来像这样:

df['label'].value_counts()hamB    2425hamA    2400spam     747

运行你的代码进行拟合:

cv = StratifiedKFold(n_splits = 10, random_state = 42, shuffle = True)pipeline_sgd = Pipeline([     ('vect', CountVectorizer()),     ('tfdif', TfidfTransformer()),     ('nb', CalibratedClassifierCV(base_estimator = SGDClassifier(), cv=cv)),])model = pipeline_sgd.fit(X_train, y_train)

这应该能工作:

n_top_labels = 3probas = model.predict_proba(X_train[:5])top_n_lables_idx = np.argsort(-probas)top_n_probs = np.round(-np.sort(-probas),3)top_n_labels = [model.classes_[i] for i in top_n_lables_idx]results = list(zip(top_n_labels, top_n_probs))pd.DataFrame(results)    0   10   [hamB, hamA, spam]  [0.608, 0.38, 0.012]1   [hamA, hamB, spam]  [0.605, 0.391, 0.004]2   [spam, hamB, hamA]  [0.603, 0.212, 0.185]3   [hamB, hamA, spam]  [0.521, 0.478, 0.001]4   [hamB, hamA, spam]  [0.645, 0.352, 0.003]

学技术

预测概率中的前三类

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复