我很难理解A) 朴素贝叶斯的输出和B) 朴素贝叶斯的predict()函数。
这不是我的数据,但这里有一个有趣的例子,展示了我试图做的事情以及我遇到的错误:
require(RTextTools)require(useful)script <- data.frame(lines=c("Rufus, Brint, and Meekus were like brothers to me. And when I say brother, I don't mean, like, an actual brother, but I mean it like the way black people use it. Which is more meaningful I think","If there is anything that this horrible tragedy can teach us, it's that a male model's life is a precious, precious commodity. Just because we have chiseled abs and stunning features, it doesn't mean that we too can't not die in a freak gasoline fight accident", "Why do you hate models, Matilda","What is this? A center for ants? How can we be expected to teach children to learn how to read... if they can't even fit inside the building?","Look, I think I know what this is about and I'm complimented but not interested.", "Hi Derek! My name's Little Cletus and I'm here to tell you a few things about child labor laws, ok? They're silly and outdated. Why back in the 30s, children as young as five could work as they pleased; from textile factories to iron smelts. Yippee! Hurray!","Todd, are you not aware that I get farty and bloated with a foamy latte?","Oh, I'm sorry, did my pin get in the way of your ass? Do me a favor and lose five pounds immediately or get out of my building like now!", "It's that damn Hansel! He's so hot right now!","Obey my dog!", "I hear words like beauty and handsomness and incredibly chiseled features and for me that's like a vanity of self absorption that I try to steer clear of.","Yeah, you're cool to hide here, but first me and him got to straighten some shit out.", "I wasn't like every other kid, you know, who dreams about being an astronaut, I was always more interested in what bark was made out of on a tree. Richard Gere's a real hero of mine. Sting. Sting would be another person who's a hero. The music he's created over the years, I don't really listen to it, but the fact that he's making it, I respect that. I care desperately about what I do. Do I know what product I'm selling? No. Do I know what I'm doing today? No. But I'm here, and I'm gonna give it my best shot.","I totally agree with you. But how do you feel about male models?", "So I'm rappelling down Mount Vesuvius when suddenly I slip, and I start to fall. Just falling, ahh ahh, I'll never forget the terror. When suddenly I realize Holy shit, Hansel, haven't you been smoking Peyote for six straight days, and couldn't some of this maybe be in your head?"))people <- as.factor(c("Zoolander","Zoolander","Zoolander","Zoolander","Zoolander", "Mugatu","Mugatu","Mugatu","Mugatu","Mugatu", "Hansel","Hansel","Hansel","Hansel","Hansel"))script.doc.matrix <- create_matrix(script$lines,language = "english",removeNumbers=TRUE, removeStopwords = TRUE, stemWords=FALSE)script.matrix <- as.matrix(script.doc.matrix)nb.script <- naiveBayes(script.matrix,people)nb.predict <- predict(nb.script,script$lines)nb.predict
我的问题:
A) 朴素贝叶斯的输出:
当我运行
nb.script$tables
我得到这样的表格:
$young youngpeople [,1] [,2] Hansel 0.0 0.0000000 Mugatu 0.2 0.4472136 Zoolander 0.0 0.0000000
我应该如何解释这些数据?我以为这些应该是概率,但我不明白每个列 [,1] 和 [,2] 的含义。此外,这些表格中展示的概率不是应该加起来等于1.0吗?为什么它们不等于1.0?如果有第三列会更合理,是不是应该有第三列?
我应该在naiveBayes()
中使用type=raw
吗?
B) 朴素贝叶斯的predict()函数:
输出结果显示,每个条目都预测为Hansel。我认为这是因为它在字母顺序上是第一个类别。在其他情况下,如果Hansel列出4次,Mugatu列出6次,Zoolander列出5次,predict()函数会将所有条目都预测为Mugatu,仅仅因为它在类别向量中出现的次数最多。
编辑:关于我的问题…我如何才能得到一个实际的预测?
预测的输出如下:
“> nb.predict
[1] Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel Hansel [12] Hansel Hansel Hansel Hansel
Levels: Hansel Mugatu Zoolander
这里有一个类似的问题的链接:R: Naives Bayes分类器仅基于先验概率做出决策 然而,这个答案对我帮助不大。
提前感谢!
回答:
关于你问题的第一部分,你的矩阵script.matrix
的列是数值型的。naiveBayes
将数值型输入解释为来自高斯分布的连续数据。你在答案中看到的表格给出了这些数值变量在各因子类别中的样本均值(第一列)和标准差(第二列)。
你可能希望朴素贝叶斯识别你的输入变量是指示变量。一个简单的方法是将整个script.matrix
转换为字符矩阵:
# 将列转换为字符 script.matrix <- apply(as.matrix(script.doc.matrix),2,as.character)
通过这个更改:
> nb.predict <- predict(nb.script,script$lines)> nb.script$tables$young youngpeople 0 1 Hansel 1.0 0.0 Mugatu 0.8 0.2 Zoolander 1.0 0.0
要查看预测的类别:
> nb.predict <- predict(nb.script, script.matrix)> nb.predict [1] Zoolander Zoolander Zoolander Zoolander Zoolander Mugatu Mugatu [8] Mugatu Mugatu Mugatu Hansel Hansel Hansel Hansel [15] Hansel Levels: Hansel Mugatu Zoolander
要查看朴素贝叶斯拟合的原始概率:
predict(nb.script, script.matrix, type='raw')