在文本分析中，当我调用fit()方法时，具体发生了什么？transform()方法对文本数据做了什么？

在文本分析中，当我调用fit()方法时，具体发生了什么？transform()方法对数据做了什么？

我可以理解对于数值类型数据的处理，但对于文本数据却难以想象其过程。

我有一个文本数组

sents_processed[0:5]['so there is no way for me plug in here in us unless go by converter', 'good case excellent value', 'great for jawbone', 'tied charger for conversations lasting more than minutes major problems', 'mic is great']

现在为了将其向量化，我使用CountVectorizer类：

from sklearn.feature_extraction.text import CountVectorizervectorizer = CountVectorizer(analyzer= 'word', tokenizer= None, preprocessor= None, stop_words= None, max_features= 4500)data_features = vectorizer.fit_transform(sents_processed)print(data_features.toarray())[[0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] ... [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0]]

我知道我会得到长度为4500的向量。然而，我无法想象fit方法在幕后具体做了什么，以及数据是如何通过transform函数转换的？特别是给定的数据是文本类型。

回答：

我们来看一个简单的例子：

from sklearn.feature_extraction.text import CountVectorizertext = ['this is a sentence', 'this is another sentence', 'not a sentence']

这里我有三句话

vector = CountVectorizer(analyzer= 'word', tokenizer= None, max_features= 4500)dt = vector.fit_transform(text)

这个过程的第一步是创建一个词汇表。它为所有句子中的每个单词分配一个数字

print(vector.vocabulary_) = {'this': 4, 'is': 1, 'sentence': 3, 'another': 0, 'not': 2}

现在它处理的是单词的对应索引而不是单词本身。方法<vector.fit_transform()>根据词汇表中提供的索引将这些句子转换成数字

data_features = vectorizer.fit_transform(text)print(data_features.toarray())= [[0 1 0 1 1] [1 1 0 1 1] [0 0 1 1 0]]

如果你只分析这个数组，它只是显示了句子。在五个单词的词汇表中，要以数组形式表示一个句子，首先我们有一个五个（词汇表的大小）零的数组，代表一个空句子

[0, 0, 0, 0, 0].

现在，如果我们拿起第一句话，并在上述数组中对应索引的位置放置1，我们就得到了那个数组

[0            1(is)       0          1(sentence)           1(this)][1(another)   1(is)       0          1(sentence)           1(this)][0            0           1(not)     1(sentence)           0      ]

如果该单词出现在那个句子中，则为1，否则为0

你只要仔细观察就能明白它是如何生成的，或者你可以阅读关于词嵌入的知识。

学技术

在文本分析中，当我调用fit()方法时，具体发生了什么？transform()方法对文本数据做了什么？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复