我为我的玩具数据集编写了一个简单的朴素贝叶斯分类器
msg spam0 free home service 11 get free data 12 we live in a home 03 i drive the car 0
完整代码
import pandas as pdfrom collections import Counterdata = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}df = pd.DataFrame(data=data)print(df)def word_counter(word_list): words = [] for x in word_list: for i in x: words.append(i) word_count = Counter(words) return word_countspam = [x.split() for x in set(df['msg'][df['spam']==1])]spam = word_counter(spam)ham = [x.split() for x in set(df['msg'][df['spam']==0])]ham = word_counter(ham)total = len(spam.keys())+len(ham.keys())# Priorspam_prior = len(df['spam'][df['spam']==1])/len(df)ham_prior = len(df['spam'][df['spam']==0])/len(df)new_data = ["get free home service","i live in car"]print("\n\tSpamminess")for msg in new_data: data = msg.split() # Likelihood spam_likelihood = 0.001 # 低值以防止除法错误 ham_likelihood = 0.001 for i in data: if i in spam: if spam_likelihood==0.001: spam_likelihood = spam[i]/total continue spam_likelihood = spam[i]/total * spam_likelihood if i in ham: if ham_likelihood==0.001: ham_likelihood = ham[i]/total continue ham_likelihood = ham[i]/total * ham_likelihood # marginal likelihood marginal = (spam_likelihood*spam_prior) + (ham_likelihood*ham_prior) spam_posterior = (spam_likelihood*spam_prior)/marginal print(msg,round(spam_posterior*100,2))
问题在于它在我的Spamminess
分类中对未见过的数据完全失败了。
get free home service 0.07i live in car 97.46
我期望get free home service
有较高的值,而i live in car
有较低的值。
我的问题是,这个错误是由于缺乏额外的数据,还是因为我的代码错误?
回答:
问题出在代码上。似然值计算不正确。请参阅Wikipedia:Naive_Bayes_classifier以获取袋装词模型下正确计算似然值的公式。
你的代码工作方式是,如果某个单词之前在垃圾邮件中未遇到过,则似然值p(word | spam)为1。使用拉普拉斯平滑后,它应该是1 / (spam_total + 1),其中spam_total是垃圾邮件中单词的总数(包括重复)。
当单词之前在垃圾邮件中出现过x次时,应该是(x + 1) / (spam_total + 1)。
我已经将Counter改为defaultdict,以便更方便地处理之前未遇到的单词,修正了似然值的计算,并添加了拉普拉斯平滑:
import pandas as pdfrom collections import defaultdictdata = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}df = pd.DataFrame(data=data)print(df)def word_counter(sentence_list): word_count = defaultdict(lambda:0) for sentence in sentence_list: for word in sentence: word_count[word] += 1 return word_countspam = [x.split() for x in set(df['msg'][df['spam']==1])]spam_total = sum([len(sentence) for sentence in spam])spam = word_counter(spam)ham = [x.split() for x in set(df['msg'][df['spam']==0])]ham_total = sum([len(sentence) for sentence in ham])ham = word_counter(ham)# Priorspam_prior = len(df['spam'][df['spam']==1])/len(df)ham_prior = len(df['spam'][df['spam']==0])/len(df)new_data = ["get free home service","i live in car"]print("\n\tSpamminess")for msg in new_data: data = msg.split() # Likelihood spam_likelihood = 1 ham_likelihood = 1 for word in data: spam_likelihood *= (spam[word] + 1) / (spam_total + 1) ham_likelihood *= (ham[word] + 1) / (ham_total + 1) # marginal likelihood marginal = (spam_likelihood * spam_prior) + (ham_likelihood * ham_prior) spam_posterior = (spam_likelihood * spam_prior) / marginal print(msg,round(spam_posterior*100,2))
现在结果符合预期:
Spamminessget free home service 98.04i live in car 20.65
这还可以进一步改进,例如为了数值稳定性,所有这些概率的乘法应该被对数的加法所替代。