我在尝试使用scikit-learn提供的不同分类器和向量化器,让我们假设我有以下内容:
training = [["this was a good movie, 'POS'"], ["this was a bad movie, 'NEG'"], ["i went to the movies, 'NEU'"], ["this movie was very exiting it was great, 'POS'"], ["this is a boring film, 'NEG'"] ,........................, [" N-sentence, 'LABEL'"]]#Where each element of the list is another list that have documents, then.splitted = [#remove the tags from training]from sklearn.feature_extraction.text import HashingVectorizerX = HashingVectorizer( tokenizer=lambda doc: doc, lowercase=False).fit_transform(splitted)print X.toarray()
然后我得到了这个向量表示:
[[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]]
问题在于我不知道我是否正确地向量化了语料库,然后:
#This is the test corpus:test = ["I don't like this movie it sucks it doesn't liked me"]#I vectorize the corpus with hashing vectorizerY = HashingVectorizer( tokenizer=lambda doc: doc, lowercase=False).fit_transform(test)
然后我打印Y
:
[[ 0. 0. 0. ..., 0. 0. 0.]]
接着
y = [x[-1]for x in training]#import SVM and classifyfrom sklearn.svm import SVCsvm = SVC()svm.fit(X, y)result = svm.predict(X)print "\nThe opinion is:\n",result
这里出现了问题,我得到的结果是以下内容,而不是实际正确的预测[NEG]:
["this was a good movie, 'POS'"]
我猜测我没有正确地向量化training
,或者y
目标设置错误,有人能帮我理解发生了什么,以及我应该如何向量化training
以获得正确的预测吗?
回答:
我将训练数据格式化的任务留给你:
training = ["this was a good movie", "this was a bad movie", "i went to the movies", "this movie was very exiting it was great", "this is a boring film"]labels = ['POS', 'NEG', 'NEU', 'POS', 'NEG']
特征提取
>>> from sklearn.feature_extraction.text import HashingVectorizer>>> vect = HashingVectorizer(n_features=5, stop_words='english', non_negative=True)>>> X_train = vect.fit_transform(training)>>> X_train.toarray()[[ 0. 0.70710678 0. 0. 0.70710678] [ 0.70710678 0.70710678 0. 0. 0. ] [ 0. 0. 0. 0. 0. ] [ 0. 0.89442719 0. 0.4472136 0. ] [ 1. 0. 0. 0. 0. ]]
对于更大的语料库,你应该增加n_features
以避免碰撞,我使用5是为了使结果矩阵可视化。另外请注意,我使用了stop_words='english'
,我认为在如此少的例子中,去除停用词是很重要的,否则可能会混淆分类器。
模型训练
from sklearn.svm import SVCmodel = SVC()model.fit(X_train, labels)
预测
>>> test = ["I don't like this movie it sucks it doesn't liked me"]>>> X_pred = vect.transform(test)>>> model.predict(X_pred)['NEG']>>> test = ["I think it was a good movie"]>>> X_pred = vect.transform(test)>>> model.predict(X_pred)['POS']
编辑:请注意,第一个测试示例的正确分类只是一个幸运的巧合,因为我没有看到任何从训练集中学到的可能被标记为负面的词。在第二个例子中,词good
可能触发了正面分类。