正在处理类似于下面的函数:
def get_feature_name_by_tfidf(text_to_process): with open(master_path + '\\additional_stopwords.txt', 'r') as f: additional_stop_words = ast.literal_eval(f.read()) stop_words = text.ENGLISH_STOP_WORDS.union(set(additional_stop_words)) tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 4), min_df=0, stop_words=stop_words) tfidf_matrix = tf.fit_transform(text_to_process.split(',')) tagged = nltk.pos_tag(tf.get_feature_names()) feature_names_with_tags = {k: v for k, v in dict(tagged).items() if v != 'VBP'} return list(feature_names_with_tags.keys())
该函数返回传递文本中的关键词列表。有没有办法让关键词保持与输入时相同的格式?例如传递的字符串是:
输入:
a = "TIME is the company where I work"
而不是得到这样的关键词列表:
['time', 'company']
我希望得到:
['TIME', 'company']
回答:
默认情况下,TfidfVectorizer会将单词转换为小写。使用以下这行代码:
tf = TfidfVectorizer(analyzer='word',lowercase=False, ngram_range=(1, 4), min_df=0, stop_words=stop_words)
应该就可以解决问题。参考这个链接 TfidfVectorizer