我正在处理一个多类文本分类问题,需要得到前三个预测标签及其相应的概率。我能够使用sklearn predict_proba()
,但是在将输出格式化成表A那样的形式时遇到了困难。我的代码如下:
cv = StratifiedKFold(n_splits = 10, random_state = 42, shuffle = None)pipeline_sgd = Pipeline([ ('vect', CountVectorizer()), ('tfdif', TfidfTransformer()), ('nb', CalibratedClassifierCV(base_estimator = SGDClassifier(), cv=cv)),])Model = pipeline_sgd.fit(X_train, y_train)n_top_labels = 3probas = model.predict_probas(test["text"])top_n_lables_idx = probas.argsort()[::-1][:n_top_lables]top_n_probs = probas[top_n_lables_idx]top_n_labels = label_encoder.inverse_transform(top_n_lables_idx.ravel())results = list(zip(top_n_labels, top_n_probas))
输出:
[(A, .80), (B, .10), (C, .10)]
我对上述输出的挑战在于,它并未为每行文本提供前三个标签/概率。例如,当我在新的一组文档(文本)上运行推理时,我只得到一个输出,而不是每个文档(行)的一个输出。
我面临的第二个挑战是,当我使用pd.Dataframe(data = results)
将其插入数据框时,我得到以下结果:
| | 0 | 1 ||---|---|-----------------|| 0 | A | [[.80,.10,.10]] || 1 | B | [[.85,.10,.05]] || 2 | C | [[.70,.20,.10]] |
它应该为:
| | 0 | 1 ||---|-------|-----------------|| 0 | A,B,C | [[.80,.10,.10]] || 1 | B,C,A | [[.85,.10,.05]] || 2 | C,B,A | [[.70,.20,.10]] |
表A
| Text | Predicted labels | Probabilities ||--------------------------------------------|------------------|----------------|| Hello World! | A,B,C | [.80,.10,10] || Have a nice Day! | B,C,A | [.90,.05,05] || It's a wonderful day in the neighborhood. | C,A,B | [.80,.10,10] |
回答:
当我运行你的代码时,top_n_probs
的形状非常奇怪,我很难将标签找回来。用于调用排序值的argsort
和代码看起来有点奇怪。
下面我写了一个快速实现,应该能工作。
使用一个示例数据集:
from sklearn.model_selection import StratifiedKFoldfrom sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import CountVectorizer,TfidfTransformerfrom sklearn.calibration import CalibratedClassifierCVfrom sklearn.linear_model import SGDClassifierimport pandas as pdimport numpy as npdf = pd.read_csv('./smsspamcollection//SMSSpamCollection', sep='\t', names=["label", "message"])df['label'][df['label']=='ham'] = np.random.choice(['hamA','hamB'],np.sum(df['label']=='ham'))X_train = df['message']y_train = df['label']
我的标签看起来像这样:
df['label'].value_counts()hamB 2425hamA 2400spam 747
运行你的代码进行拟合:
cv = StratifiedKFold(n_splits = 10, random_state = 42, shuffle = True)pipeline_sgd = Pipeline([ ('vect', CountVectorizer()), ('tfdif', TfidfTransformer()), ('nb', CalibratedClassifierCV(base_estimator = SGDClassifier(), cv=cv)),])model = pipeline_sgd.fit(X_train, y_train)
这应该能工作:
n_top_labels = 3probas = model.predict_proba(X_train[:5])top_n_lables_idx = np.argsort(-probas)top_n_probs = np.round(-np.sort(-probas),3)top_n_labels = [model.classes_[i] for i in top_n_lables_idx]results = list(zip(top_n_labels, top_n_probs))pd.DataFrame(results) 0 10 [hamB, hamA, spam] [0.608, 0.38, 0.012]1 [hamA, hamB, spam] [0.605, 0.391, 0.004]2 [spam, hamB, hamA] [0.603, 0.212, 0.185]3 [hamB, hamA, spam] [0.521, 0.478, 0.001]4 [hamB, hamA, spam] [0.645, 0.352, 0.003]