从零开始用Python实现朴素贝叶斯分类器?

我为我的玩具数据集编写了一个简单的朴素贝叶斯分类器

                 msg  spam0  free home service     11      get free data     12  we live in a home     03    i drive the car     0

完整代码

import pandas as pdfrom collections import Counterdata = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}df = pd.DataFrame(data=data)print(df)def word_counter(word_list):    words = []    for x in word_list:        for i in x:            words.append(i)        word_count = Counter(words)    return word_countspam = [x.split() for x in set(df['msg'][df['spam']==1])]spam = word_counter(spam)ham = [x.split() for x in set(df['msg'][df['spam']==0])]ham = word_counter(ham)total = len(spam.keys())+len(ham.keys())# Priorspam_prior = len(df['spam'][df['spam']==1])/len(df)ham_prior = len(df['spam'][df['spam']==0])/len(df)new_data = ["get free home service","i live in car"]print("\n\tSpamminess")for msg in new_data:    data = msg.split()        # Likelihood    spam_likelihood = 0.001 # 低值以防止除法错误    ham_likelihood = 0.001    for i in data:        if i in spam:            if spam_likelihood==0.001:                spam_likelihood = spam[i]/total                continue            spam_likelihood = spam[i]/total * spam_likelihood        if i in ham:            if ham_likelihood==0.001:                ham_likelihood = ham[i]/total                continue            ham_likelihood = ham[i]/total * ham_likelihood        # marginal likelihood    marginal = (spam_likelihood*spam_prior) + (ham_likelihood*ham_prior)        spam_posterior = (spam_likelihood*spam_prior)/marginal    print(msg,round(spam_posterior*100,2))

问题在于它在我的Spamminess分类中对未见过的数据完全失败了。

get free home service 0.07i live in car 97.46

我期望get free home service有较高的值,而i live in car有较低的值。

我的问题是,这个错误是由于缺乏额外的数据,还是因为我的代码错误?


回答:

问题出在代码上。似然值计算不正确。请参阅Wikipedia:Naive_Bayes_classifier以获取袋装词模型下正确计算似然值的公式。

你的代码工作方式是,如果某个单词之前在垃圾邮件中未遇到过,则似然值p(word | spam)为1。使用拉普拉斯平滑后,它应该是1 / (spam_total + 1),其中spam_total是垃圾邮件中单词的总数(包括重复)。

当单词之前在垃圾邮件中出现过x次时,应该是(x + 1) / (spam_total + 1)。

我已经将Counter改为defaultdict,以便更方便地处理之前未遇到的单词,修正了似然值的计算,并添加了拉普拉斯平滑:

import pandas as pdfrom collections import defaultdictdata = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}df = pd.DataFrame(data=data)print(df)def word_counter(sentence_list):    word_count = defaultdict(lambda:0)    for sentence in sentence_list:        for word in sentence:            word_count[word] += 1    return word_countspam = [x.split() for x in set(df['msg'][df['spam']==1])]spam_total = sum([len(sentence) for sentence in spam])spam = word_counter(spam)ham = [x.split() for x in set(df['msg'][df['spam']==0])]ham_total = sum([len(sentence) for sentence in ham])ham = word_counter(ham)# Priorspam_prior = len(df['spam'][df['spam']==1])/len(df)ham_prior = len(df['spam'][df['spam']==0])/len(df)new_data = ["get free home service","i live in car"]print("\n\tSpamminess")for msg in new_data:    data = msg.split()        # Likelihood    spam_likelihood = 1    ham_likelihood = 1    for word in data:        spam_likelihood *= (spam[word] + 1) / (spam_total + 1)        ham_likelihood *= (ham[word] + 1) / (ham_total + 1)        # marginal likelihood    marginal = (spam_likelihood * spam_prior) + (ham_likelihood * ham_prior)        spam_posterior = (spam_likelihood * spam_prior) / marginal    print(msg,round(spam_posterior*100,2))

现在结果符合预期:

    Spamminessget free home service 98.04i live in car 20.65

这还可以进一步改进,例如为了数值稳定性,所有这些概率的乘法应该被对数的加法所替代。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注