从头实现TF-IDF向量化器

我在尝试用Python从头开始实现一个tf-idf向量化器。我计算了我的TDF值，但这些值与使用sklearn的TfidfVectorizer()计算的TDF值不匹配。

我做错了什么？

corpus = [ 'this is the first document', 'this document is the second document', 'and this is the third one', 'is this the first document',]from collections import Counterfrom tqdm import tqdmfrom scipy.sparse import csr_matriximport mathimport operatorfrom sklearn.preprocessing import normalizeimport numpysentence = []for i in range(len(corpus)):sentence.append(corpus[i].split())word_freq = {}   #计算单词的文档频率for i in range(len(sentence)):    tokens = sentence[i]    for w in tokens:        try:            word_freq[w].add(i)  #将单词添加为键         except:            word_freq[w] = {i}  #如果已经存在，则不添加。for i in word_freq:    word_freq[i] = len(word_freq[i])  #计算单词（键）在整个语料库中出现的次数，从而得到该单词的频率。def idf():    idfDict = {}    for word in word_freq:        idfDict[word] = math.log(len(sentence) / word_freq[word])    return idfDictidfDict = idf()

预期输出：（使用vectorizer.idf_获得的输出）

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073 1.22314355 1.91629073 1.        ]

实际输出：（这些值是对应键的idf值）

{'and': 1.3862943611198906,'document': 0.28768207245178085,'first': 0.6931471805599453,'is': 0.0,'one': 1.3862943611198906,'second': 1.3862943611198906,'the': 0.0,'third': 1.3862943611198906,'this': 0.0 }

回答：

有几个默认参数可能会影响sklearn的计算，但这里特别重要的一个是：

smooth_idf : boolean (default=True)通过向文档频率中添加一个，相当于看到了一个额外的文档，其中集合中的每个术语都恰好出现一次，来平滑idf权重。防止零除错误。

如果你从每个元素中减去1，然后将e提升到该幂，你会得到非常接近5 / n的值，对于n的低值来说：

1.91629073 => 5/21.22314355 => 5/41.51082562 => 5/31 => 5/5

无论如何，并不存在单一的tf-idf实现；你定义的度量只是试图观察某些属性（如“更高的idf应该与语料库中的稀有性相关”）的启发式方法，所以我不会太担心实现完全相同的实现。

sklearn似乎使用了： log((document_length + 1) / (frequency of word + 1)) + 1这就像有一个文档包含了语料库中的每一个单词。

编辑：最后一段得到了TfIdfNormalizer的文档字符串的证实。

学技术

从头实现TF-IDF向量化器

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复