使用predict_proba或decision_function作为估计器的“置信度”

我在scikit-learn中使用LogisticRegression作为模型来训练一个估计器。我使用的特征大多是分类变量，标签也是分类变量。因此，我分别使用DictVectorizer和LabelEncoder来正确编码这些值。

训练部分相当简单，但我在测试部分遇到了问题。简单的方法是使用训练模型的”predict”方法来获取预测的标签。然而，为了后续处理的需要，我需要每个特定实例的每个可能标签（类别）的概率。我决定使用”predict_proba”方法。然而，对于同一个测试实例，当它单独使用或与其他实例一起使用时，我得到了不同的结果。

接下来是一个重现问题的代码。

from sklearn.linear_model import LogisticRegressionfrom sklearn.feature_extraction import DictVectorizerfrom sklearn.preprocessing import LabelEncoderX_real = [{'head': u'n\xe3o', 'dep_rel': u'ADVL'},           {'head': u'v\xe3o', 'dep_rel': u'ACC'},           {'head': u'empresa', 'dep_rel': u'SUBJ'},           {'head': u'era', 'dep_rel': u'ACC'},           {'head': u't\xeam', 'dep_rel': u'ACC'},           {'head': u'import\xe2ncia', 'dep_rel': u'PIV'},           {'head': u'balan\xe7o', 'dep_rel': u'SUBJ'},           {'head': u'ocupam', 'dep_rel': u'ACC'},           {'head': u'acesso', 'dep_rel': u'PRED'},           {'head': u'elas', 'dep_rel': u'SUBJ'},           {'head': u'assinaram', 'dep_rel': u'ACC'},           {'head': u'agredido', 'dep_rel': u'SUBJ'},           {'head': u'pol\xedcia', 'dep_rel': u'ADVL'},           {'head': u'se', 'dep_rel': u'ACC'}] y_real = [u'AM-NEG', u'A1', u'A0', u'A1', u'A1', u'A1', u'A0', u'A1', u'AM-ADV', u'A0', u'A1', u'A0', u'A2', u'A1']feat_encoder =  DictVectorizer()feat_encoder.fit(X_real)label_encoder = LabelEncoder()label_encoder.fit(y_real)model = LogisticRegression()model.fit(feat_encoder.transform(X_real), label_encoder.transform(y_real))print "Test 1..."X_test1 = [{'head': u'governo', 'dep_rel': u'SUBJ'}]X_test1_encoded = feat_encoder.transform(X_test1)print "Features Encoded"print X_test1_encodedprint "Shape"print X_test1_encoded.shapeprint "decision_function:"print model.decision_function(X_test1_encoded)print "predict_proba:"print model.predict_proba(X_test1_encoded)print "Test 2..."X_test2 = [{'head': u'governo', 'dep_rel': u'SUBJ'},            {'head': u'atrav\xe9s', 'dep_rel': u'ADVL'},            {'head': u'configuram', 'dep_rel': u'ACC'}]X_test2_encoded = feat_encoder.transform(X_test2)print "Features Encoded"print X_test2_encodedprint "Shape"print X_test2_encoded.shapeprint "decision_function:"print model.decision_function(X_test2_encoded)print "predict_proba:"print model.predict_proba(X_test2_encoded)print "Test 3..."X_test3 = [{'head': u'governo', 'dep_rel': u'SUBJ'},            {'head': u'atrav\xe9s', 'dep_rel': u'ADVL'},            {'head': u'configuram', 'dep_rel': u'ACC'},           {'head': u'configuram', 'dep_rel': u'ACC'},]X_test3_encoded = feat_encoder.transform(X_test3)print "Features Encoded"print X_test3_encodedprint "Shape"print X_test3_encoded.shapeprint "decision_function:"print model.decision_function(X_test3_encoded)print "predict_proba:"print model.predict_proba(X_test3_encoded)

以下是获得的输出：

Test 1...Features Encoded  (0, 4)    1.0Shape(1, 19)decision_function:[[ 0.55372615 -1.02949707 -1.75474347 -1.73324726 -1.75474347]]predict_proba:[[ 1.  1.  1.  1.  1.]]Test 2...Features Encoded  (0, 4)    1.0  (1, 1)    1.0  (2, 0)    1.0Shape(3, 19)decision_function:[[ 0.55372615 -1.02949707 -1.75474347 -1.73324726 -1.75474347] [-1.07370197 -0.69103629 -0.89306092 -1.51402163 -0.89306092] [-1.55921001  1.11775556 -1.92080112 -1.90133404 -1.92080112]]predict_proba:[[ 0.59710757  0.19486904  0.26065002  0.32612646  0.26065002] [ 0.23950111  0.24715931  0.51348452  0.3916478   0.51348452] [ 0.16339132  0.55797165  0.22586546  0.28222574  0.22586546]]Test 3...Features Encoded  (0, 4)    1.0  (1, 1)    1.0  (2, 0)    1.0  (3, 0)    1.0Shape(4, 19)decision_function:[[ 0.55372615 -1.02949707 -1.75474347 -1.73324726 -1.75474347] [-1.07370197 -0.69103629 -0.89306092 -1.51402163 -0.89306092] [-1.55921001  1.11775556 -1.92080112 -1.90133404 -1.92080112] [-1.55921001  1.11775556 -1.92080112 -1.90133404 -1.92080112]]predict_proba:[[ 0.5132474   0.12507868  0.21262531  0.25434403  0.21262531] [ 0.20586462  0.15864173  0.4188751   0.30544372  0.4188751 ] [ 0.14044399  0.3581398   0.1842498   0.22010613  0.1842498 ] [ 0.14044399  0.3581398   0.1842498   0.22010613  0.1842498 ]]

如您所见，使用”predict_proba”方法获得的”X_test1″实例的值在与X_test2中的其他实例一起时会发生变化。此外，”X_test3″只是复制了”X_test2″并添加了一个额外的实例（与”X_test2″中的最后一个相同），但所有实例的概率值都发生了变化。这是为什么呢？此外，我觉得”X_test1″的所有概率都是1非常奇怪，所有概率的总和不应该是1吗？

现在，如果我使用”decision_function”而不是”predict_proba”，我得到了我需要的一致的值。问题是，我得到了负系数，甚至一些正系数大于1。

那么，我应该使用哪个方法呢？为什么”predict_proba”的值会这样变化？我是否没有正确理解这些值的含义？

非常感谢您能提供的任何帮助。

更新

如建议所示，我更改了代码，以便还打印编码后的”X_test1″、”X_test2″和”X_test3″，以及它们的形状。这似乎不是问题所在，因为在测试集之间，相同实例的编码是一致的。

回答：

正如问题评论中所指出的，错误是由我使用的scikit-learn版本中的一个实现错误引起的。通过更新到最新的稳定版本0.12.1，问题得到了解决。

学技术

使用predict_proba或decision_function作为估计器的“置信度”

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复