我使用以下代码进行文本分类的预测:
predicted = clf.predict(X_new_tfidf)
我的预测结果要么表明文本片段属于主题A,要么属于主题B。然而,我希望对那些不确定的预测进行进一步分析——也就是说,如果模型对是A还是B非常不确定,但为了做出选择不得不选一个。有没有办法提取预测的相对置信度?
代码:
X_train
包含["我知道属于主题A的句子", "另一个描述主题A的句子", "关于主题B的句子", "另一个关于主题B的句子"...]
等
Y_train
包含相应的分类器:["主题A", "主题A", "主题B", "主题B", ...]
等
predict_these_X
是我希望分类的句子列表:["某个随机句子", "另一个句子", "又一个句子", ...]
等
count_vect = CountVectorizer() tfidf_transformer = TfidfTransformer() X_train_counts = count_vect.fit_transform(X_train) X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) X_new_counts = count_vect.transform(predict_these_X) X_new_tfidf = tfidf_transformer.transform(X_new_counts) estimator = BernoulliNB() estimator.fit(X_train_tfidf, Y_train) predictions = estimator.predict(X_new_tfidf) print estimator.predict_proba(X_new_tfidf) return predictions
结果:
[[ 9.97388646e-07 9.99999003e-01] [ 9.99996892e-01 3.10826824e-06] [ 9.40063326e-01 5.99366742e-02] [ 9.99999964e-01 3.59816546e-08] ... [ 1.95070084e-10 1.00000000e+00] [ 3.21721965e-15 1.00000000e+00] [ 1.00000000e+00 3.89012777e-10]]
回答:
from sklearn.datasets import make_classificationfrom sklearn.naive_bayes import BernoulliNB# generate some artificial dataX, y = make_classification(n_samples=1000, n_features=50, weights=[0.1, 0.9])# your estimatorestimator = BernoulliNB()estimator.fit(X, y)# generate predictionsestimator.predict(X)Out[164]: array([1, 1, 1, ..., 0, 1, 1])# to get confidence on the predictionestimator.predict_proba(X)Out[163]: array([[ 0.0043, 0.9957], [ 0.0046, 0.9954], [ 0.0071, 0.9929], ..., [ 0.8392, 0.1608], [ 0.0018, 0.9982], [ 0.0339, 0.9661]])
现在你可以看到,对于前三个观察值,每个都有超过99%的概率是正例。