如何查看每个词的tf-idf得分

我试图了解文档中每个词的tf-idf得分。然而，它只返回矩阵中的值，但我看到的是tf-idf得分对每个词的特定表示方式。

我已经处理过代码，代码可以运行，但是我想改变它的呈现方式：

代码：

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformerbow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())print(len(bow_transformer.vocabulary_))tfidf_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])bow_transformer.vocabulary_transformer().fit(message_bow)message_tfidf = tfidf_transformer.transform(message_bow)

我得到的结果类似于(39028,01),(1393,1672)。然而，我期望的结果是这样的

features    tfidffruit       0.00344excellent   0.00289

回答：

你可以通过以下代码实现上述结果：

def extract_topn_from_vector(feature_names, sorted_items, topn=5):    """      get the feature names and tf-idf score of top n items in the doc,                       in descending order of scores.     """    # use only top n items from vector.    sorted_items = sorted_items[:topn]    results= {}     # word index and corresponding tf-idf score    for idx, score in sorted_items:        results[feature_names[idx]] = round(score, 3)    # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores).    return sorted(results.items(), key=lambda kv: kv[1], reverse=True)feature_names = count_vect.get_feature_names()coo_matrix = message_tfidf.tocoo()tuples = zip(coo_matrix.col, coo_matrix.data)sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)# extract only the top n elements.# Here, n is 10.word_tfidf = extract_topn_from_vector(feature_names, sorted_items, 10)print("{}  {}".format("features", "tfidf"))  for k in word_tfidf:    print("{} - {}".format(k[0], k[1]))

查看下面的完整代码，以更好地理解上述代码片段。下面的代码自解释性很强。

完整代码：

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom nltk.corpus import stopwordsimport stringimport nltkimport pandas as pddata = pd.read_csv('yourfile.csv')stops = set(stopwords.words("english"))wl = nltk.WordNetLemmatizer()def clean_text(text):    """      - Remove Punctuations      - Tokenization      - Remove Stopwords      - stemming/lemmatizing    """    text_nopunct = "".join([char for char in text if char not in string.punctuation])    tokens = re.split("\W+", text)    text = [word for word in tokens if word not in stops]    text = [wl.lemmatize(word) for word in text]    return textdef extract_topn_from_vector(feature_names, sorted_items, topn=5):    """      get the feature names and tf-idf score of top n items in the doc,                       in descending order of scores.     """    # use only top n items from vector.    sorted_items = sorted_items[:topn]    results= {}     # word index and corresponding tf-idf score    for idx, score in sorted_items:        results[feature_names[idx]] = round(score, 3)    # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores).    return sorted(results.items(), key=lambda kv: kv[1], reverse=True)count_vect = CountVectorizer(analyzer=clean_text, tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000)                                        freq_term_matrix = count_vect.fit_transform(data['text_body'])tfidf = TfidfTransformer(norm="l2")tfidf.fit(freq_term_matrix)  feature_names = count_vect.get_feature_names()# sample documentdoc = 'watched horrid thing TV. Needless say one movies watch see much worse get.'tf_idf_vector = tfidf.transform(count_vect.transform([doc]))coo_matrix = tf_idf_vector.tocoo()tuples = zip(coo_matrix.col, coo_matrix.data)sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)# extract only the top n elements.# Here, n is 10.word_tfidf = extract_topn_from_vector(feature_names,sorted_items,10)print("{}  {}".format("features", "tfidf"))  for k in word_tfidf:    print("{} - {}".format(k[0], k[1]))

样本输出：

features  tfidfNeedless - 0.515horrid - 0.501worse - 0.312watched - 0.275TV - 0.272say - 0.202watch - 0.199thing - 0.189much - 0.177see - 0.164

学技术

如何查看每个词的tf-idf得分

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复