如何查看每个词的tf-idf得分

我试图了解文档中每个词的tf-idf得分。然而,它只返回矩阵中的值,但我看到的是tf-idf得分对每个词的特定表示方式。

我已经处理过代码,代码可以运行,但是我想改变它的呈现方式:

代码:

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformerbow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())print(len(bow_transformer.vocabulary_))tfidf_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])bow_transformer.vocabulary_transformer().fit(message_bow)message_tfidf = tfidf_transformer.transform(message_bow)

我得到的结果类似于(39028,01),(1393,1672)。然而,我期望的结果是这样的

features    tfidffruit       0.00344excellent   0.00289

回答:

你可以通过以下代码实现上述结果:

def extract_topn_from_vector(feature_names, sorted_items, topn=5):    """      get the feature names and tf-idf score of top n items in the doc,                       in descending order of scores.     """    # use only top n items from vector.    sorted_items = sorted_items[:topn]    results= {}     # word index and corresponding tf-idf score    for idx, score in sorted_items:        results[feature_names[idx]] = round(score, 3)    # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores).    return sorted(results.items(), key=lambda kv: kv[1], reverse=True)feature_names = count_vect.get_feature_names()coo_matrix = message_tfidf.tocoo()tuples = zip(coo_matrix.col, coo_matrix.data)sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)# extract only the top n elements.# Here, n is 10.word_tfidf = extract_topn_from_vector(feature_names, sorted_items, 10)print("{}  {}".format("features", "tfidf"))  for k in word_tfidf:    print("{} - {}".format(k[0], k[1])) 

查看下面的完整代码,以更好地理解上述代码片段。下面的代码自解释性很强。

完整代码:

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom nltk.corpus import stopwordsimport stringimport nltkimport pandas as pddata = pd.read_csv('yourfile.csv')stops = set(stopwords.words("english"))wl = nltk.WordNetLemmatizer()def clean_text(text):    """      - Remove Punctuations      - Tokenization      - Remove Stopwords      - stemming/lemmatizing    """    text_nopunct = "".join([char for char in text if char not in string.punctuation])    tokens = re.split("\W+", text)    text = [word for word in tokens if word not in stops]    text = [wl.lemmatize(word) for word in text]    return textdef extract_topn_from_vector(feature_names, sorted_items, topn=5):    """      get the feature names and tf-idf score of top n items in the doc,                       in descending order of scores.     """    # use only top n items from vector.    sorted_items = sorted_items[:topn]    results= {}     # word index and corresponding tf-idf score    for idx, score in sorted_items:        results[feature_names[idx]] = round(score, 3)    # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores).    return sorted(results.items(), key=lambda kv: kv[1], reverse=True)count_vect = CountVectorizer(analyzer=clean_text, tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000)                                        freq_term_matrix = count_vect.fit_transform(data['text_body'])tfidf = TfidfTransformer(norm="l2")tfidf.fit(freq_term_matrix)  feature_names = count_vect.get_feature_names()# sample documentdoc = 'watched horrid thing TV. Needless say one movies watch see much worse get.'tf_idf_vector = tfidf.transform(count_vect.transform([doc]))coo_matrix = tf_idf_vector.tocoo()tuples = zip(coo_matrix.col, coo_matrix.data)sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)# extract only the top n elements.# Here, n is 10.word_tfidf = extract_topn_from_vector(feature_names,sorted_items,10)print("{}  {}".format("features", "tfidf"))  for k in word_tfidf:    print("{} - {}".format(k[0], k[1])) 

样本输出:

features  tfidfNeedless - 0.515horrid - 0.501worse - 0.312watched - 0.275TV - 0.272say - 0.202watch - 0.199thing - 0.189much - 0.177see - 0.164

Related Posts

为什么我们在K-means聚类方法中使用kmeans.fit函数?

我在一个视频中使用K-means聚类技术,但我不明白为…

如何获取Keras中ImageDataGenerator的.flow_from_directory函数扫描的类名?

我想制作一个用户友好的GUI图像分类器,用户只需指向数…

如何修复 ‘ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]’?

我在制作一个用于情感分析的逻辑回归模型时遇到了这个问题…

如何向神经网络输入两个不同大小的输入?

我想向神经网络输入两个数据集。第一个数据集(元素)具有…

逻辑回归与机器学习有何关联

我们正在开会讨论聘请一位我们信任的顾问来做机器学习。一…

在sklearn和pandas中将字符串特征转换为数值特征

我目前正在使用sklearn(我还是个初学者),我想训…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注