如何查看每个词的tf-idf得分

我试图了解文档中每个词的tf-idf得分。然而,它只返回矩阵中的值,但我看到的是tf-idf得分对每个词的特定表示方式。

我已经处理过代码,代码可以运行,但是我想改变它的呈现方式:

代码:

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformerbow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())print(len(bow_transformer.vocabulary_))tfidf_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])bow_transformer.vocabulary_transformer().fit(message_bow)message_tfidf = tfidf_transformer.transform(message_bow)

我得到的结果类似于(39028,01),(1393,1672)。然而,我期望的结果是这样的

features    tfidffruit       0.00344excellent   0.00289

回答:

你可以通过以下代码实现上述结果:

def extract_topn_from_vector(feature_names, sorted_items, topn=5):    """      get the feature names and tf-idf score of top n items in the doc,                       in descending order of scores.     """    # use only top n items from vector.    sorted_items = sorted_items[:topn]    results= {}     # word index and corresponding tf-idf score    for idx, score in sorted_items:        results[feature_names[idx]] = round(score, 3)    # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores).    return sorted(results.items(), key=lambda kv: kv[1], reverse=True)feature_names = count_vect.get_feature_names()coo_matrix = message_tfidf.tocoo()tuples = zip(coo_matrix.col, coo_matrix.data)sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)# extract only the top n elements.# Here, n is 10.word_tfidf = extract_topn_from_vector(feature_names, sorted_items, 10)print("{}  {}".format("features", "tfidf"))  for k in word_tfidf:    print("{} - {}".format(k[0], k[1])) 

查看下面的完整代码,以更好地理解上述代码片段。下面的代码自解释性很强。

完整代码:

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom nltk.corpus import stopwordsimport stringimport nltkimport pandas as pddata = pd.read_csv('yourfile.csv')stops = set(stopwords.words("english"))wl = nltk.WordNetLemmatizer()def clean_text(text):    """      - Remove Punctuations      - Tokenization      - Remove Stopwords      - stemming/lemmatizing    """    text_nopunct = "".join([char for char in text if char not in string.punctuation])    tokens = re.split("\W+", text)    text = [word for word in tokens if word not in stops]    text = [wl.lemmatize(word) for word in text]    return textdef extract_topn_from_vector(feature_names, sorted_items, topn=5):    """      get the feature names and tf-idf score of top n items in the doc,                       in descending order of scores.     """    # use only top n items from vector.    sorted_items = sorted_items[:topn]    results= {}     # word index and corresponding tf-idf score    for idx, score in sorted_items:        results[feature_names[idx]] = round(score, 3)    # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores).    return sorted(results.items(), key=lambda kv: kv[1], reverse=True)count_vect = CountVectorizer(analyzer=clean_text, tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000)                                        freq_term_matrix = count_vect.fit_transform(data['text_body'])tfidf = TfidfTransformer(norm="l2")tfidf.fit(freq_term_matrix)  feature_names = count_vect.get_feature_names()# sample documentdoc = 'watched horrid thing TV. Needless say one movies watch see much worse get.'tf_idf_vector = tfidf.transform(count_vect.transform([doc]))coo_matrix = tf_idf_vector.tocoo()tuples = zip(coo_matrix.col, coo_matrix.data)sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)# extract only the top n elements.# Here, n is 10.word_tfidf = extract_topn_from_vector(feature_names,sorted_items,10)print("{}  {}".format("features", "tfidf"))  for k in word_tfidf:    print("{} - {}".format(k[0], k[1])) 

样本输出:

features  tfidfNeedless - 0.515horrid - 0.501worse - 0.312watched - 0.275TV - 0.272say - 0.202watch - 0.199thing - 0.189much - 0.177see - 0.164

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注