如何合并两个CountVectorizer以处理重复项？

考虑这个简单的例子

data = pd.DataFrame({'text1' : ['hello world', 'hello universe'],                     'text2': ['good morning', 'hello two three']})    dataOut[489]:             text1            text20     hello world     good morning1  hello universe  hello two three

如您所见，text1和text2有一个相同的单词：hello。我试图分别为text1和text2创建ngram，并希望将结果合并到一个CountVectorizer对象中。

我的想法是为这两个变量分别创建ngram，并将它们用作机器学习算法中的特征。然而，我确实希望通过将字符串连接在一起来创建额外的ngram，例如在hello world good morning中的world good。这就是为什么我将ngram的创建分开的原因。

问题在于这样做会导致结果（稀疏）向量中包含重复的hello列。

请看这里：

vector = CountVectorizer(ngram_range=(1, 2))v1 = vector.fit_transform(data.text1.values) print(vector.get_feature_names())['hello', 'hello universe', 'hello world', 'universe', 'world']v2 = vector.fit_transform(data.text2.values)print(vector.get_feature_names())['good', 'good morning', 'hello', 'hello two', 'morning', 'three', 'two', 'two three']

现在，将v1和v2连接起来会得到13列

from scipy.sparse import hstackprint(hstack((v1, v2)).toarray())[[1 0 1 0 1 1 1 0 0 1 0 0 0] [1 1 0 1 0 0 0 1 1 0 1 1 1]]

正确的文本特征应该是12个：

hello, word, hello word, good, morning, good morning,hello universe,universe, two, three, hello two, two three

我该怎么做才能得到正确的唯一词作为特征？谢谢！

回答：

我认为解决这个问题的更好方法是创建一个自定义的Transformer，使用CountVectorizer。

我会这样做：

from sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.feature_extraction.text import CountVectorizerimport numpy as npclass MultiRowsCountVectorizer(BaseEstimator, TransformerMixin):    def __init__(self):        self.verctorizer = CountVectorizer(ngram_range=(1, 2))        def fit(self, X, y = None):        #将所有文本列连接成一列        X_ = np.reshape(X.values, (-1,))        self.verctorizer.fit(X_)        return self        def transform(self, X, y = None):        #将所有文本列连接成一列        X_ = X.apply(' '.join, axis=1)        return self.verctorizer.transform(X_)        def get_feature_names(self):        return self.verctorizer.get_feature_names()        transformer = MultiRowsCountVectorizer()X_ = transformer.fit_transform(data)transformer.get_feature_names()

fit()方法通过独立处理列来拟合CountVectorizer，而transform()方法则将列视为同一行文本进行处理。

np.reshape(X.values, (-1,))将形状为(N, n_columns)的矩阵转换为大小为(N*n_columns,)的一维数组。这确保在fit()期间每个文本字段都被独立处理。之后，通过将它们连接在一起，对样本的所有文本特征进行转换。

这个自定义的Transformer返回了所需的12个特征：

['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'world']

并返回以下特征：

[[1 1 1 0 0 1 1 0 0 0 0 1] [0 0 2 1 1 0 0 1 1 1 1 0]]

注意：这个自定义的Transformer假设X是一个带有n个文本列的pd.DataFrame。

编辑：在transform()期间，文本字段需要用空格连接。

学技术

如何合并两个CountVectorizer以处理重复项？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复