我正在对文本数据进行LDA分析,使用的是这里的示例。我的问题是:
如何知道哪些文档对应哪些主题?换句话说,例如,关于主题1的文档在讨论什么?
以下是我的步骤:
n_features = 1000n_topics = 8n_top_words = 20
我逐行读取我的文本文件:
with open('dataset.txt', 'r') as data_file: input_lines = [line.strip() for line in data_file.readlines()] mydata = [line for line in input_lines]
一个打印主题的函数:
def print_top_words(model, feature_names, n_top_words): for topic_idx, topic in enumerate(model.components_): print("Topic #%d:" % topic_idx) print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])) print()
对数据进行向量化处理:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, token_pattern='\\b\\w{2,}\\w+\\b', max_features=n_features, stop_words='english')tf = tf_vectorizer.fit_transform(mydata)
初始化LDA:
lda = LatentDirichletAllocation(n_topics=3, max_iter=5, learning_method='online', learning_offset=50., random_state=0)
在tf数据上运行LDA:
lda.fit(tf)
使用上述函数打印结果:
print("\nTopics in LDA model:")tf_feature_names = tf_vectorizer.get_feature_names()print_top_words(lda, tf_feature_names, n_top_words)
打印输出的结果是:
Topics in LDA model:Topic #0:solar road body lamp power battery energy beaconTopic #1:skin cosmetic hair extract dermatological aging production activeTopic #2:cosmetic oil water agent block emulsion ingredients mixture
回答:
你需要对数据进行转换:
doc_topic = lda.transform(tf)
然后像这样列出文档及其最高得分的主题:
for n in range(doc_topic.shape[0]): topic_most_pr = doc_topic[n].argmax() print("doc: {} topic: {}\n".format(n,topic_most_pr))