为什么”good movies”的输出是0.707107?我认为应该是:1/1*ln(5/2) = 0.91629 才对。
from sklearn.feature_extraction.text import TfidfVectorizerimport pandas as pdtexts = [ "good movie", "not a good movie", "did not like", "i like it", "good one"]# using default tokenizer in TfidfVectorizertfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))features = tfidf.fit_transform(texts)pd.DataFrame( features.todense(), columns=tfidf.get_feature_names())
回答:
这是因为norm
和smooth_idf
参数。默认情况下,这两个参数都是true。
from sklearn.feature_extraction.text import TfidfVectorizerimport pandas as pdtexts = [ "good movie", "not a good movie", "did not like", "i like it", "good one"]# using default tokenizer in TfidfVectorizertfidf = TfidfVectorizer(min_df=2, max_df=0.5,norm=None,smooth_idf=False, ngram_range=(1, 2))features = tfidf.fit_transform(texts)pd.DataFrame( features.todense(), columns=tfidf.get_feature_names())
输出结果:
good movie like movie not0 1.916291 0.000000 1.916291 0.0000001 1.916291 0.000000 1.916291 1.9162912 0.000000 1.916291 0.000000 1.9162913 0.000000 1.916291 0.000000 0.0000004 0.000000 0.000000 0.000000 0.000000
默认情况下,sklearn用于计算idf的公式是log [ n / df(t) ] + 1
。所以你的计算结果0.91621再加1就对了。
如果设置smooth_idf=True
(默认值),那么公式变为log [ (1 + n) / (1 + df(d, t)) ] + 1
tfidf = TfidfVectorizer(min_df=2, max_df=0.5,norm=None,smooth_idf=True, ngram_range=(1, 2))
的输出结果是
good movie like movie not0 1.693147 0.000000 1.693147 0.0000001 1.693147 0.000000 1.693147 1.6931472 0.000000 1.693147 0.000000 1.6931473 0.000000 1.693147 0.000000 0.0000004 0.000000 0.000000 0.000000 0.000000
0.707107是怎么来的?
如果你看第一行的数据,我们有1.693417(称之为a)出现了两次,因此L2范数是sqrt(a^2 + a^2),也就是sqrt(1.69 ^ 2 + 1.69 ^ 2) = sqrt(5.73349),等于2.3944。现在你将1.693147除以2.3944,大约得到0.707107。
阅读此文档