标记文本分类问题,预测错误?

我在尝试使用scikit-learn提供的不同分类器和向量化器,让我们假设我有以下内容:

training = [["this was a good movie, 'POS'"],      ["this was a bad movie, 'NEG'"],      ["i went to the movies, 'NEU'"],       ["this movie was very exiting it was great, 'POS'"],       ["this is a boring film, 'NEG'"]        ,........................,          [" N-sentence, 'LABEL'"]]#Where each element of the list is another list that have documents, then.splitted = [#remove the tags from training]from sklearn.feature_extraction.text import HashingVectorizerX = HashingVectorizer(    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(splitted)print X.toarray()

然后我得到了这个向量表示:

[[ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.]]

问题在于我不知道我是否正确地向量化了语料库,然后:

#This is the test corpus:test = ["I don't like this movie it sucks it doesn't liked me"]#I vectorize the corpus with hashing vectorizerY = HashingVectorizer(    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(test)

然后我打印Y

[[ 0.  0.  0. ...,  0.  0.  0.]]

接着

y = [x[-1]for x in training]#import SVM and classifyfrom sklearn.svm import SVCsvm = SVC()svm.fit(X, y)result = svm.predict(X)print "\nThe opinion is:\n",result

这里出现了问题,我得到的结果是以下内容,而不是实际正确的预测[NEG]:

["this was a good movie, 'POS'"]

我猜测我没有正确地向量化training,或者y目标设置错误,有人能帮我理解发生了什么,以及我应该如何向量化training以获得正确的预测吗?


回答:

我将训练数据格式化的任务留给你:

training = ["this was a good movie",            "this was a bad movie",            "i went to the movies",            "this movie was very exiting it was great",             "this is a boring film"]labels = ['POS', 'NEG', 'NEU', 'POS', 'NEG']

特征提取

>>> from sklearn.feature_extraction.text import HashingVectorizer>>> vect = HashingVectorizer(n_features=5, stop_words='english', non_negative=True)>>> X_train = vect.fit_transform(training)>>> X_train.toarray()[[ 0.          0.70710678  0.          0.          0.70710678] [ 0.70710678  0.70710678  0.          0.          0.        ] [ 0.          0.          0.          0.          0.        ] [ 0.          0.89442719  0.          0.4472136   0.        ] [ 1.          0.          0.          0.          0.        ]]

对于更大的语料库,你应该增加n_features以避免碰撞,我使用5是为了使结果矩阵可视化。另外请注意,我使用了stop_words='english',我认为在如此少的例子中,去除停用词是很重要的,否则可能会混淆分类器。

模型训练

from sklearn.svm import SVCmodel = SVC()model.fit(X_train, labels)

预测

>>> test = ["I don't like this movie it sucks it doesn't liked me"]>>> X_pred = vect.transform(test)>>> model.predict(X_pred)['NEG']>>> test = ["I think it was a good movie"]>>> X_pred = vect.transform(test)>>> model.predict(X_pred)['POS']

编辑:请注意,第一个测试示例的正确分类只是一个幸运的巧合,因为我没有看到任何从训练集中学到的可能被标记为负面的词。在第二个例子中,词good可能触发了正面分类。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注