我试图了解文档中每个词的tf-idf
得分。然而,它只返回矩阵中的值,但我看到的是tf-idf
得分对每个词的特定表示方式。
我已经处理过代码,代码可以运行,但是我想改变它的呈现方式:
代码:
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformerbow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())print(len(bow_transformer.vocabulary_))tfidf_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])bow_transformer.vocabulary_transformer().fit(message_bow)message_tfidf = tfidf_transformer.transform(message_bow)
我得到的结果类似于(39028,01),(1393,1672)
。然而,我期望的结果是这样的
features tfidffruit 0.00344excellent 0.00289
回答:
你可以通过以下代码实现上述结果:
def extract_topn_from_vector(feature_names, sorted_items, topn=5): """ get the feature names and tf-idf score of top n items in the doc, in descending order of scores. """ # use only top n items from vector. sorted_items = sorted_items[:topn] results= {} # word index and corresponding tf-idf score for idx, score in sorted_items: results[feature_names[idx]] = round(score, 3) # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores). return sorted(results.items(), key=lambda kv: kv[1], reverse=True)feature_names = count_vect.get_feature_names()coo_matrix = message_tfidf.tocoo()tuples = zip(coo_matrix.col, coo_matrix.data)sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)# extract only the top n elements.# Here, n is 10.word_tfidf = extract_topn_from_vector(feature_names, sorted_items, 10)print("{} {}".format("features", "tfidf")) for k in word_tfidf: print("{} - {}".format(k[0], k[1]))
查看下面的完整代码,以更好地理解上述代码片段。下面的代码自解释性很强。
完整代码:
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom nltk.corpus import stopwordsimport stringimport nltkimport pandas as pddata = pd.read_csv('yourfile.csv')stops = set(stopwords.words("english"))wl = nltk.WordNetLemmatizer()def clean_text(text): """ - Remove Punctuations - Tokenization - Remove Stopwords - stemming/lemmatizing """ text_nopunct = "".join([char for char in text if char not in string.punctuation]) tokens = re.split("\W+", text) text = [word for word in tokens if word not in stops] text = [wl.lemmatize(word) for word in text] return textdef extract_topn_from_vector(feature_names, sorted_items, topn=5): """ get the feature names and tf-idf score of top n items in the doc, in descending order of scores. """ # use only top n items from vector. sorted_items = sorted_items[:topn] results= {} # word index and corresponding tf-idf score for idx, score in sorted_items: results[feature_names[idx]] = round(score, 3) # return a sorted list of tuples with feature name and tf-idf score as its element(in descending order of tf-idf scores). return sorted(results.items(), key=lambda kv: kv[1], reverse=True)count_vect = CountVectorizer(analyzer=clean_text, tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000) freq_term_matrix = count_vect.fit_transform(data['text_body'])tfidf = TfidfTransformer(norm="l2")tfidf.fit(freq_term_matrix) feature_names = count_vect.get_feature_names()# sample documentdoc = 'watched horrid thing TV. Needless say one movies watch see much worse get.'tf_idf_vector = tfidf.transform(count_vect.transform([doc]))coo_matrix = tf_idf_vector.tocoo()tuples = zip(coo_matrix.col, coo_matrix.data)sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)# extract only the top n elements.# Here, n is 10.word_tfidf = extract_topn_from_vector(feature_names,sorted_items,10)print("{} {}".format("features", "tfidf")) for k in word_tfidf: print("{} - {}".format(k[0], k[1]))
样本输出:
features tfidfNeedless - 0.515horrid - 0.501worse - 0.312watched - 0.275TV - 0.272say - 0.202watch - 0.199thing - 0.189much - 0.177see - 0.164