Python – NLTK 训练/测试集分割

我一直在关注SentDex关于NLTK和Python的视频系列,并编写了一个脚本,使用各种模型(例如逻辑回归)来确定评论的情感。我担心的是,我认为SentDex的方法在确定用于训练的词语时包含了测试集,这显然是不理想的(训练/测试集分割应该在特征选择之后进行)。

(根据Mohammed Kashif的评论进行了编辑)

完整代码:

import nltkimport numpy as npfrom nltk.classify.scikitlearn import SklearnClassifierfrom nltk.classify import ClassifierIfrom nltk.corpus import movie_reviewsfrom sklearn.naive_bayes import MultinomialNBdocuments = [ (list(movie_reviews.words(fileid)), category)             for category in movie_reviews.categories()             for fileid in movie_reviews.fileids(category) ]all_words = []for w in movie_reviews.words():    all_words.append(w.lower())all_words = nltk.FreqDist(all_words)word_features = list(all_words.keys())[:3000]def find_features(documents):    words = set(documents)    features = {}    for w in word_features:        features[w] = (w in words)    return featuresfeaturesets = [(find_features(rev), category) for (rev, category) in documents]np.random.shuffle(featuresets)training_set = featuresets[:1800]testing_set = featuresets[1800:]MNB_classifier = SklearnClassifier(MultinomialNB())MNB_classifier.train(training_set)print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

已经尝试过:

documents = [ (list(movie_reviews.words(fileid)), category)             for category in movie_reviews.categories()             for fileid in movie_reviews.fileids(category) ]np.random.shuffle(documents)training_set = documents[:1800]testing_set = documents[1800:]all_words = []for w in documents.words():    all_words.append(w.lower())all_words = nltk.FreqDist(all_words)word_features = list(all_words.keys())[:3000]def find_features(training_set):    words = set(training_set)    features = {}    for w in word_features:        features[w] = (w in words)    return featuresfeaturesets = [(find_features(rev), category) for (rev, category) in training_set]np.random.shuffle(featuresets)training_set = featuresetstesting_set = testing_setMNB_classifier = SklearnClassifier(MultinomialNB())MNB_classifier.train(training_set)print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

导致错误:

Traceback (most recent call last):

File “”, line 34, in print(“MNB_classifier accuracy:”, (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

File “C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\util.py”, line 87, in accuracy results = classifier.classify_many([fs for (fs, l) in gold])

File “C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\scikitlearn.py”, line 85, in classify_many X = self._vectorizer.transform(featuresets)

File “C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py”, line 291, in transform return self._transform(X, fitting=False)

File “C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py”, line 166, in _transform for f, v in six.iteritems(x):

File “C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\six.py”, line 439, in iteritems return iter(getattr(d, _iteritems)(**kw))

AttributeError: ‘list’ object has no attribute ‘items’


回答:

好的,代码中有一些错误。我们将逐一解决这些问题。

首先,您的documents列表是一个元组列表,它没有words()方法。为了访问所有单词,请像这样更改for循环

all_words = []for words_list, categ in documents:      #<-- 每个单词列表是一个单词列表    for w in words_list:                 #<-- 然后访问列表中的每个单词        all_words.append(w.lower())

其次,您需要为trainingtest集创建特征集。您只为training_set使用了特征集。将代码更改为以下内容

featuresets = [(find_features(rev), category) for (rev, category) in documents]

np.random.shuffle(featuresets)training_set = featuresets[:1800]testing_set = featuresets[1800:]

所以最终代码变为

documents = [ (list(movie_reviews.words(fileid)), category)             for category in movie_reviews.categories()             for fileid in movie_reviews.fileids(category) ]np.random.shuffle(documents)training_set = documents[:1800]testing_set = documents[1800:]all_words = []for words_list, categ in documents:    for w in words_list:        all_words.append(w.lower())all_words = nltk.FreqDist(all_words)word_features = list(all_words.keys())[:3000]def find_features(training_set):    words = set(training_set)    features = {}    for w in word_features:        features[w] = (w in words)    return featuresfeaturesets = [(find_features(rev), category) for (rev, category) in documents]np.random.shuffle(featuresets)training_set = featuresets[:1800]testing_set = featuresets[1800:]MNB_classifier = SklearnClassifier(MultinomialNB())MNB_classifier.train(training_set)print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注