Python – NLTK 训练/测试集分割

我一直在关注SentDex关于NLTK和Python的视频系列,并编写了一个脚本,使用各种模型(例如逻辑回归)来确定评论的情感。我担心的是,我认为SentDex的方法在确定用于训练的词语时包含了测试集,这显然是不理想的(训练/测试集分割应该在特征选择之后进行)。

(根据Mohammed Kashif的评论进行了编辑)

完整代码:

import nltkimport numpy as npfrom nltk.classify.scikitlearn import SklearnClassifierfrom nltk.classify import ClassifierIfrom nltk.corpus import movie_reviewsfrom sklearn.naive_bayes import MultinomialNBdocuments = [ (list(movie_reviews.words(fileid)), category)             for category in movie_reviews.categories()             for fileid in movie_reviews.fileids(category) ]all_words = []for w in movie_reviews.words():    all_words.append(w.lower())all_words = nltk.FreqDist(all_words)word_features = list(all_words.keys())[:3000]def find_features(documents):    words = set(documents)    features = {}    for w in word_features:        features[w] = (w in words)    return featuresfeaturesets = [(find_features(rev), category) for (rev, category) in documents]np.random.shuffle(featuresets)training_set = featuresets[:1800]testing_set = featuresets[1800:]MNB_classifier = SklearnClassifier(MultinomialNB())MNB_classifier.train(training_set)print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

已经尝试过:

documents = [ (list(movie_reviews.words(fileid)), category)             for category in movie_reviews.categories()             for fileid in movie_reviews.fileids(category) ]np.random.shuffle(documents)training_set = documents[:1800]testing_set = documents[1800:]all_words = []for w in documents.words():    all_words.append(w.lower())all_words = nltk.FreqDist(all_words)word_features = list(all_words.keys())[:3000]def find_features(training_set):    words = set(training_set)    features = {}    for w in word_features:        features[w] = (w in words)    return featuresfeaturesets = [(find_features(rev), category) for (rev, category) in training_set]np.random.shuffle(featuresets)training_set = featuresetstesting_set = testing_setMNB_classifier = SklearnClassifier(MultinomialNB())MNB_classifier.train(training_set)print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

导致错误:

Traceback (most recent call last):

File “”, line 34, in print(“MNB_classifier accuracy:”, (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

File “C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\util.py”, line 87, in accuracy results = classifier.classify_many([fs for (fs, l) in gold])

File “C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\scikitlearn.py”, line 85, in classify_many X = self._vectorizer.transform(featuresets)

File “C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py”, line 291, in transform return self._transform(X, fitting=False)

File “C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py”, line 166, in _transform for f, v in six.iteritems(x):

File “C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\six.py”, line 439, in iteritems return iter(getattr(d, _iteritems)(**kw))

AttributeError: ‘list’ object has no attribute ‘items’


回答:

好的,代码中有一些错误。我们将逐一解决这些问题。

首先,您的documents列表是一个元组列表,它没有words()方法。为了访问所有单词,请像这样更改for循环

all_words = []for words_list, categ in documents:      #<-- 每个单词列表是一个单词列表    for w in words_list:                 #<-- 然后访问列表中的每个单词        all_words.append(w.lower())

其次,您需要为trainingtest集创建特征集。您只为training_set使用了特征集。将代码更改为以下内容

featuresets = [(find_features(rev), category) for (rev, category) in documents]

np.random.shuffle(featuresets)training_set = featuresets[:1800]testing_set = featuresets[1800:]

所以最终代码变为

documents = [ (list(movie_reviews.words(fileid)), category)             for category in movie_reviews.categories()             for fileid in movie_reviews.fileids(category) ]np.random.shuffle(documents)training_set = documents[:1800]testing_set = documents[1800:]all_words = []for words_list, categ in documents:    for w in words_list:        all_words.append(w.lower())all_words = nltk.FreqDist(all_words)word_features = list(all_words.keys())[:3000]def find_features(training_set):    words = set(training_set)    features = {}    for w in word_features:        features[w] = (w in words)    return featuresfeaturesets = [(find_features(rev), category) for (rev, category) in documents]np.random.shuffle(featuresets)training_set = featuresets[:1800]testing_set = featuresets[1800:]MNB_classifier = SklearnClassifier(MultinomialNB())MNB_classifier.train(training_set)print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

Related Posts

Keras Dense层输入未被展平

这是我的测试代码: from keras import…

无法将分类变量输入随机森林

我有10个分类变量和3个数值变量。我在分割后直接将它们…

如何在Keras中对每个输出应用Sigmoid函数?

这是我代码的一部分。 model = Sequenti…

如何选择类概率的最佳阈值?

我的神经网络输出是一个用于多标签分类的预测类概率表: …

在Keras中使用深度学习得到不同的结果

我按照一个教程使用Keras中的深度神经网络进行文本分…

‘MatMul’操作的输入’b’类型为float32,与参数’a’的类型float64不匹配

我写了一个简单的TensorFlow代码,但不断遇到T…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注