我一直在关注SentDex关于NLTK和Python的视频系列,并编写了一个脚本,使用各种模型(例如逻辑回归)来确定评论的情感。我担心的是,我认为SentDex的方法在确定用于训练的词语时包含了测试集,这显然是不理想的(训练/测试集分割应该在特征选择之后进行)。
(根据Mohammed Kashif的评论进行了编辑)
完整代码:
import nltkimport numpy as npfrom nltk.classify.scikitlearn import SklearnClassifierfrom nltk.classify import ClassifierIfrom nltk.corpus import movie_reviewsfrom sklearn.naive_bayes import MultinomialNBdocuments = [ (list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category) ]all_words = []for w in movie_reviews.words(): all_words.append(w.lower())all_words = nltk.FreqDist(all_words)word_features = list(all_words.keys())[:3000]def find_features(documents): words = set(documents) features = {} for w in word_features: features[w] = (w in words) return featuresfeaturesets = [(find_features(rev), category) for (rev, category) in documents]np.random.shuffle(featuresets)training_set = featuresets[:1800]testing_set = featuresets[1800:]MNB_classifier = SklearnClassifier(MultinomialNB())MNB_classifier.train(training_set)print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)
已经尝试过:
documents = [ (list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category) ]np.random.shuffle(documents)training_set = documents[:1800]testing_set = documents[1800:]all_words = []for w in documents.words(): all_words.append(w.lower())all_words = nltk.FreqDist(all_words)word_features = list(all_words.keys())[:3000]def find_features(training_set): words = set(training_set) features = {} for w in word_features: features[w] = (w in words) return featuresfeaturesets = [(find_features(rev), category) for (rev, category) in training_set]np.random.shuffle(featuresets)training_set = featuresetstesting_set = testing_setMNB_classifier = SklearnClassifier(MultinomialNB())MNB_classifier.train(training_set)print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)
导致错误:
Traceback (most recent call last):
File “”, line 34, in print(“MNB_classifier accuracy:”, (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)
File “C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\util.py”, line 87, in accuracy results = classifier.classify_many([fs for (fs, l) in gold])
File “C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\scikitlearn.py”, line 85, in classify_many X = self._vectorizer.transform(featuresets)
File “C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py”, line 291, in transform return self._transform(X, fitting=False)
File “C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py”, line 166, in _transform for f, v in six.iteritems(x):
File “C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\six.py”, line 439, in iteritems return iter(getattr(d, _iteritems)(**kw))
AttributeError: ‘list’ object has no attribute ‘items’
回答:
好的,代码中有一些错误。我们将逐一解决这些问题。
首先,您的documents
列表是一个元组列表,它没有words()
方法。为了访问所有单词,请像这样更改for循环
all_words = []for words_list, categ in documents: #<-- 每个单词列表是一个单词列表 for w in words_list: #<-- 然后访问列表中的每个单词 all_words.append(w.lower())
其次,您需要为training
和test
集创建特征集。您只为training_set
使用了特征集。将代码更改为以下内容
featuresets = [(find_features(rev), category) for (rev, category) in documents]
np.random.shuffle(featuresets)training_set = featuresets[:1800]testing_set = featuresets[1800:]
所以最终代码变为
documents = [ (list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category) ]np.random.shuffle(documents)training_set = documents[:1800]testing_set = documents[1800:]all_words = []for words_list, categ in documents: for w in words_list: all_words.append(w.lower())all_words = nltk.FreqDist(all_words)word_features = list(all_words.keys())[:3000]def find_features(training_set): words = set(training_set) features = {} for w in word_features: features[w] = (w in words) return featuresfeaturesets = [(find_features(rev), category) for (rev, category) in documents]np.random.shuffle(featuresets)training_set = featuresets[:1800]testing_set = featuresets[1800:]MNB_classifier = SklearnClassifier(MultinomialNB())MNB_classifier.train(training_set)print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)