我正在使用NLTK的朴素贝叶斯分类器进行分类分析。我插入了一个包含记录和标签的tsv文件。
但由于错误,文件未能进行训练。以下是我的Python代码
import numpy as npimport matplotlib.pyplot as pltimport pandas as pddataset = pd.read_csv('tweets.txt', delimiter ='\t', quoting = 3)dataset.isnull().any()dataset = dataset.fillna(method='ffill')import reimport nltkfrom nltk.corpus import stopwordsfrom nltk.stem.porter import PorterStemmercorpus = []for i in range(0,16004): tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i]) tweet = tweet.lower() tweet = tweet.split() ps = PorterStemmer() tweet = [ps.stem(word) for word in tweet if not word in set(stopwords.words('english'))] tweet = ' '.join(tweet) corpus.append(tweet)from sklearn.feature_extraction.text import CountVectorizercv = CountVectorizer(max_features = 10000)X = cv.fit_transform(corpus).toarray()y = dataset.iloc[:, 1].valuesfrom sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)train_set, test_set = X_train[500:], y_train[:500]classifier = nltk.NaiveBayesClassifier.train(train_set)
错误信息如下:
File "C:\Users\HSR\Anaconda2\lib\site-packages\nltk\classify\naivebayes.py", line 194, in trainfor featureset, label in labeled_featuresets:ValueError: too many values to unpack
回答:
NLTKClassifier的工作方式与scikit的估计器不同。它要求X
和y
都放在一个数组中,然后传递给train()
函数。
但在你的代码中,你只提供了X_train
,它试图从中解包y
,因此出现了错误。
朴素贝叶斯分类器需要输入是一个元组列表,其中列表表示训练样本,元组内部包含特征字典和标签。类似于以下格式:
X = [({feature1:'val11', feature2:'val12' .... }, class1), ({feature1:'val21', feature2:'val22' .... }, class2), ... ... ]
你需要将你的输入更改为这种格式。
feature_names = cv.get_feature_names()train_set = []for i, single_sample in enumerate(X): single_feature_dict = {} for j, single_feature in enumerate(single_sample): single_feature_dict[feature_names[j]]=single_feature train_set.append((single_feature_dict, y[i]))
注意:上面的for循环可以通过使用字典推导式来简化,但我在这方面还不太熟练。
然后你可以这样做:
nltk.NaiveBayesClassifier.train(train_set)