朴素贝叶斯分类器 – 空词汇表

我正在尝试使用朴素贝叶斯来检测文本中的幽默。我从这里获取了这段代码,但我遇到了一些错误,由于我对机器学习和这些算法还比较陌生,我不知道如何解决这些问题。我的训练数据包含了一些单行笑话。我知道其他人也问过同样的问题,但我还没有找到答案。

import osimport iofrom pandas import DataFramefrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBdef readFiles(path):    for root, dirnames, filenames in os.walk(path):        for filename in filenames:            path = os.path.join(root, filename)            inBody = False            lines = []            f = io.open(path, 'r', encoding='latin1')            for line in f:                if inBody:                    lines.append(line)                elif line == '\n':                    inBody = True            f.close()            message = '\n'.join(lines)            yield path, messagedef dataFrameFromDirectory(path, classification):    rows = []    index = []    for filename, message in readFiles(path):        rows.append({'message': message, 'class': classification})        index.append(filename)    return DataFrame(rows, index=index)data = DataFrame({'message': [], 'class': []})data = data.append(dataFrameFromDirectory('G:/PyCharmProjects/naive_bayes_classifier/train_jokes', 'funny'))data = data.append(dataFrameFromDirectory('G:/PyCharmProjects/naive_bayes_classifier/train_non_jokes', 'notfunny'))vectorizer = CountVectorizer()counts = vectorizer.fit_transform(data['message'].values)classifier = MultinomialNB()targets = data['class'].valuesclassifier.fit(counts, targets)examples = ['Where do steers go to dance?  The Meat Ball', 'tomorrow I press this button']examples_counts = vectorizer.transform(examples)predictions = classifier.predict(examples_counts)print(predictions)

以下是错误信息:

Traceback (most recent call last):  File "G:/PyCharmProjects/naive_bayes_classifier/NaiveBayesClassifier.py", line 55, in <module>    counts = vectorizer.fit_transform(data['message'].values)  File "C:\Users\mr_wizard\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 869, in fit_transform    self.fixed_vocabulary_)  File "C:\Users\mr_wizard\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 811, in _count_vocab    raise ValueError("empty vocabulary; perhaps the documents only"ValueError: empty vocabulary; perhaps the documents only contain stop words

这是来自train_jokes的一些输入:

"[me narrating a documentary about narrators] ""I can't hear what they're saying cuz I'm talking""""Telling my daughter garlic is good for you. Good immune system and keeps pests away.Ticks, mosquitos, vampires... men."I've been going through a really rough period at work this week It's my own fault for swapping my tampax for sand paper."If I could have dinner with anyone, dead or alive... ...I would choose alive. -B.J. Novak-"Two guys walk into a bar. The third guy ducks.Why can't Barbie get pregnant? Because Ken comes in a different box. HeyoooooooWhy was the musician arrested? He got in treble.Did you hear about the guy who blew his entire lottery winnings on a limousine? He had nothing left to chauffeur it.What do you do if a bird shits on your car? Don't ask her out again.He was a real gentlemen and always opened the fridge door for me

train_jokes包含大约250,000个单行笑话或推文,而train_non_jokes包含一些不幽默的简单句子。目前,我还没有准备好非幽默文件,只有一些来自Twitter的句子。


回答:

问题不在于代码,而在于训练数据。首先,G:/PyCharmProjects/naive_bayes_classifier/train_jokesG:/PyCharmProjects/naive_bayes_classifier/train_non_jokes必须是包含训练数据文件的目录路径(因此train_jokes和train_non_jokes是目录)。另一方面,我的文件中没有换行符,所以变量inBody一直是false。为了让程序正常运行,训练数据需要像这样:

text here and then blank lineanother textand this is it

(我只是删除了对inBody的引用,这解决了换行符的问题)。这些是我在观看那个视频时忽略的一些细节,因为他没有提到这些。感谢大家的回答,这对我帮助很大。

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注