标记文本分类问题，预测错误？

我在尝试使用scikit-learn提供的不同分类器和向量化器，让我们假设我有以下内容：

training = [["this was a good movie, 'POS'"],      ["this was a bad movie, 'NEG'"],      ["i went to the movies, 'NEU'"],       ["this movie was very exiting it was great, 'POS'"],       ["this is a boring film, 'NEG'"]        ,........................,          [" N-sentence, 'LABEL'"]]#Where each element of the list is another list that have documents, then.splitted = [#remove the tags from training]from sklearn.feature_extraction.text import HashingVectorizerX = HashingVectorizer(    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(splitted)print X.toarray()

然后我得到了这个向量表示：

[[ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.]]

问题在于我不知道我是否正确地向量化了语料库，然后：

#This is the test corpus:test = ["I don't like this movie it sucks it doesn't liked me"]#I vectorize the corpus with hashing vectorizerY = HashingVectorizer(    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(test)

然后我打印Y：

[[ 0.  0.  0. ...,  0.  0.  0.]]

接着

y = [x[-1]for x in training]#import SVM and classifyfrom sklearn.svm import SVCsvm = SVC()svm.fit(X, y)result = svm.predict(X)print "\nThe opinion is:\n",result

这里出现了问题，我得到的结果是以下内容，而不是实际正确的预测[NEG]：

["this was a good movie, 'POS'"]

我猜测我没有正确地向量化training，或者y目标设置错误，有人能帮我理解发生了什么，以及我应该如何向量化training以获得正确的预测吗？

回答：

我将训练数据格式化的任务留给你：

training = ["this was a good movie",            "this was a bad movie",            "i went to the movies",            "this movie was very exiting it was great",             "this is a boring film"]labels = ['POS', 'NEG', 'NEU', 'POS', 'NEG']

特征提取

>>> from sklearn.feature_extraction.text import HashingVectorizer>>> vect = HashingVectorizer(n_features=5, stop_words='english', non_negative=True)>>> X_train = vect.fit_transform(training)>>> X_train.toarray()[[ 0.          0.70710678  0.          0.          0.70710678] [ 0.70710678  0.70710678  0.          0.          0.        ] [ 0.          0.          0.          0.          0.        ] [ 0.          0.89442719  0.          0.4472136   0.        ] [ 1.          0.          0.          0.          0.        ]]

对于更大的语料库，你应该增加n_features以避免碰撞，我使用5是为了使结果矩阵可视化。另外请注意，我使用了stop_words='english'，我认为在如此少的例子中，去除停用词是很重要的，否则可能会混淆分类器。

模型训练

from sklearn.svm import SVCmodel = SVC()model.fit(X_train, labels)

预测

>>> test = ["I don't like this movie it sucks it doesn't liked me"]>>> X_pred = vect.transform(test)>>> model.predict(X_pred)['NEG']>>> test = ["I think it was a good movie"]>>> X_pred = vect.transform(test)>>> model.predict(X_pred)['POS']

编辑：请注意，第一个测试示例的正确分类只是一个幸运的巧合，因为我没有看到任何从训练集中学到的可能被标记为负面的词。在第二个例子中，词good可能触发了正面分类。

学技术

标记文本分类问题，预测错误？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复