标记文本分类问题,预测错误?

我在尝试使用scikit-learn提供的不同分类器和向量化器,让我们假设我有以下内容:

training = [["this was a good movie, 'POS'"],      ["this was a bad movie, 'NEG'"],      ["i went to the movies, 'NEU'"],       ["this movie was very exiting it was great, 'POS'"],       ["this is a boring film, 'NEG'"]        ,........................,          [" N-sentence, 'LABEL'"]]#Where each element of the list is another list that have documents, then.splitted = [#remove the tags from training]from sklearn.feature_extraction.text import HashingVectorizerX = HashingVectorizer(    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(splitted)print X.toarray()

然后我得到了这个向量表示:

[[ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.] [ 0.  0.  0. ...,  0.  0.  0.]]

问题在于我不知道我是否正确地向量化了语料库,然后:

#This is the test corpus:test = ["I don't like this movie it sucks it doesn't liked me"]#I vectorize the corpus with hashing vectorizerY = HashingVectorizer(    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(test)

然后我打印Y

[[ 0.  0.  0. ...,  0.  0.  0.]]

接着

y = [x[-1]for x in training]#import SVM and classifyfrom sklearn.svm import SVCsvm = SVC()svm.fit(X, y)result = svm.predict(X)print "\nThe opinion is:\n",result

这里出现了问题,我得到的结果是以下内容,而不是实际正确的预测[NEG]:

["this was a good movie, 'POS'"]

我猜测我没有正确地向量化training,或者y目标设置错误,有人能帮我理解发生了什么,以及我应该如何向量化training以获得正确的预测吗?


回答:

我将训练数据格式化的任务留给你:

training = ["this was a good movie",            "this was a bad movie",            "i went to the movies",            "this movie was very exiting it was great",             "this is a boring film"]labels = ['POS', 'NEG', 'NEU', 'POS', 'NEG']

特征提取

>>> from sklearn.feature_extraction.text import HashingVectorizer>>> vect = HashingVectorizer(n_features=5, stop_words='english', non_negative=True)>>> X_train = vect.fit_transform(training)>>> X_train.toarray()[[ 0.          0.70710678  0.          0.          0.70710678] [ 0.70710678  0.70710678  0.          0.          0.        ] [ 0.          0.          0.          0.          0.        ] [ 0.          0.89442719  0.          0.4472136   0.        ] [ 1.          0.          0.          0.          0.        ]]

对于更大的语料库,你应该增加n_features以避免碰撞,我使用5是为了使结果矩阵可视化。另外请注意,我使用了stop_words='english',我认为在如此少的例子中,去除停用词是很重要的,否则可能会混淆分类器。

模型训练

from sklearn.svm import SVCmodel = SVC()model.fit(X_train, labels)

预测

>>> test = ["I don't like this movie it sucks it doesn't liked me"]>>> X_pred = vect.transform(test)>>> model.predict(X_pred)['NEG']>>> test = ["I think it was a good movie"]>>> X_pred = vect.transform(test)>>> model.predict(X_pred)['POS']

编辑:请注意,第一个测试示例的正确分类只是一个幸运的巧合,因为我没有看到任何从训练集中学到的可能被标记为负面的词。在第二个例子中,词good可能触发了正面分类。

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注