### 朴素高斯预测概率仅返回0或1

我训练了scikit-learn中的GaussianNB模型。当我调用classifier.predict_proba方法时，它在新数据上只返回1或0。预期应该返回预测正确与否的置信度百分比。我怀疑它对从未见过的新数据有100%的置信度。我已经在多个不同的输入上进行了测试。我使用CountVectorizer和TfidfTransformer进行文本编码。

编码如下：

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformercount_vect = CountVectorizer()tfidf_transformer = TfidfTransformer()X_train_counts = count_vect.fit_transform(X_train_word)X_train = tfidf_transformer.fit_transform(X_train_counts).toarray()print(X_train)X_test_counts = count_vect.transform(X_test_word)X_test = tfidf_transformer.transform(X_test_counts).toarray()print(X_test)

模型如下：（我获得了91%的准确率）

from sklearn.naive_bayes import GaussianNBclassifier = GaussianNB()classifier.fit(X_train, y_train)# 预测类别y_pred = classifier.predict(X_test)# 准确率from sklearn.metrics import accuracy_scoreaccuracy = accuracy_score(y_test, y_pred)print(accuracy)

最后，当我使用predict_proba方法时：

y_pred = classifier.predict_proba(X_test)print(y_pred)

我得到的输出类似于：

[[0. 1.] [1. 0.] [0. 1.] ... [1. 0.] [1. 0.] [1. 0.]]

在新数据上获得100%的准确率似乎不太合理。除了在y_test上测试过之外，我还在其他输入上测试过，结果仍然相同。任何帮助都将不胜感激！

评论的编辑：.predict_log_proba()的响应更加奇怪：

[[ 0.00000000e+00 -6.95947375e+09] [-4.83948755e+09  0.00000000e+00] [ 0.00000000e+00 -1.26497690e+10] ... [ 0.00000000e+00 -6.97191054e+09] [ 0.00000000e+00 -2.25589894e+09] [ 0.00000000e+00 -2.93089863e+09]]

回答：

让我在公共的20个新闻组数据集上重现你的结果。为了简化，我将只使用两个组和30个观察值：

from sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.preprocessing import FunctionTransformerfrom sklearn.naive_bayes import GaussianNBfrom sklearn.pipeline import make_pipelinecats = ['alt.atheism', 'sci.space']newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)# 故意创建一个非常小的训练集X_small, y_small = newsgroups_train['data'][:30], newsgroups_train['target'][:30]print(y_small)# [0 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1]

现在让我们训练一个模型。我将使用一个管道来将所有算法堆叠在一个处理器中：

model = make_pipeline(    CountVectorizer(),     TfidfTransformer(),     FunctionTransformer(lambda x: x.todense(), accept_sparse=True),     GaussianNB())model.fit(X_small, y_small);print(model.predict_proba(newsgroups_test['data']))# [[1. 0.]#  [0. 1.]#  [1. 0.]print((model.predict(X_small) == y_small).mean())# 1.0print((model.predict(newsgroups_test['data']) == newsgroups_test['target']).mean())# 0.847124824684432print(model.predict_proba(newsgroups_test['data']).max(axis=1).mean())# 0.9994305488454233

事实上，并非所有预测的概率都是0或1，但大多数是这样的。预测类别的平均预测概率为99.94%，因此模型对其预测平均而言非常自信。

我们看到训练集上的准确率是完美的，但测试集上的准确率只有84.7%。因此，看起来我们的GaussianNB模型在过拟合——也就是说，它过度依赖于训练数据集。是的，即使是像NB这样简单的算法，如果特征空间很大，这种情况也是可能的。而使用CountVectorizer时，词汇表中的每个单词都是一个单独的特征，所有可能的单词数量相当大。因此我们的模型在过拟合，这就是为什么它会产生由零和一组成的过于自信的预测。

通常，我们可以通过正则化来对抗过拟合。对于GaussianNB，正则化模型的最简单方法是将参数var_smoothing设置为某个相对较大的正值（默认值为10^-8）。根据我的经验，我建议在0.01到1之间的值。这里我将其设置为0.3。这意味着最分散特征的30%方差（即在类别之间分布最均匀的词）将被添加到所有其他特征上。

model2 = make_pipeline(    CountVectorizer(),     TfidfTransformer(),     FunctionTransformer(lambda x: x.todense(), accept_sparse=True),     GaussianNB(var_smoothing=0.3))model2.fit(X_small, y_small);print(model2.predict_proba(newsgroups_test['data']))# [[1.00000000e+00 6.95414544e-11]#  [2.55262953e-02 9.74473705e-01]#  [9.97333826e-01 2.66617361e-03]print((model2.predict(X_small) == y_small).mean())# 1.0print((model2.predict(newsgroups_test['data']) == newsgroups_test['target']).mean())# 0.8821879382889201print(model2.predict_proba(newsgroups_test['data']).max(axis=1).mean())# 0.9657781853646639

我们可以看到，添加正则化后，我们模型的预测变得不那么自信了：平均置信度为96.57%，而不是99.94%。此外，测试集上的准确率也有所提高，因为这种过度自信导致模型做出了一些错误的预测。

这些错误预测的逻辑可以这样说明。在没有正则化的情况下，模型完全依赖于训练集中单词的频率。例如，当它看到文本“从X射线中死亡的概率”时，模型会想“我只在关于无神论的文本中见过‘死亡’这个词，所以这一定是关于无神论的文本”。但这实际上是关于太空的文本，一个更正则化的模型不会对其结论如此确定，并且会保留一些小但非零的概率，认为包含“死亡”一词的文本可能涉及无神论之外的其他主题。

因此，这里得到的教训是：无论你使用什么学习算法，都要找出如何对其进行正则化，并谨慎调整正则化参数。

学技术

### 朴素高斯预测概率仅返回0或1

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复