我在pyspark(Spark 2.1.1)上使用客户评论数据集训练了一个LDA模型。现在我想基于该模型预测新未见文本中的主题。
我使用以下代码创建模型
from pyspark import SparkConf, SparkContextfrom pyspark.sql import SparkSessionfrom pyspark.sql import SQLContext, Rowfrom pyspark.ml.feature import CountVectorizerfrom pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemoverfrom pyspark.mllib.clustering import LDA, LDAModelfrom pyspark.ml.clustering import DistributedLDAModel, LocalLDAModelfrom pyspark.mllib.linalg import Vector, Vectorsfrom pyspark.sql.functions import *import pyspark.sql.functions as Fpath = "D:/sparkdata/sample_text_LDA.txt"sc = SparkContext("local[*]", "review")spark = SparkSession.builder.appName('Basics').getOrCreate()df = spark.read.csv("D:/sparkdata/customers_data.csv", header=True, inferSchema=True)data = df.select("Reviews").rdd.map(list).map(lambda x: x[0]).zipWithIndex().map(lambda words: Row(idd= words[1], words = words[0].split(" "))).collect()docDF = spark.createDataFrame(data)remover = StopWordsRemover(inputCol="words",outputCol="stopWordsRemoved")stopWordsRemoved_df = remover.transform(docDF).cache()Vector = CountVectorizer(inputCol="stopWordsRemoved", outputCol="vectors")model = Vector.fit(stopWordsRemoved_df)result = model.transform(stopWordsRemoved_df)corpus = result.select("idd", "vectors").rdd.map(lambda x: [x[0],Vectors.fromML(x[1])]).cache()# Cluster the documents topics using LDAldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')topics = ldaModel.topicsMatrix()vocabArray = model.vocabularyprint(ldaModel.describeTopics())wordNumbers = 10 # number of words per topictopicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))def topic_render(topic): # specify vector id of words to actual words terms = topic[0] result = [] for i in range(wordNumbers): term = vocabArray[terms[i]] result.append(term) return resulttopics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()for topic in range(len(topics_final)): print("Topic" + str(topic) + ":") for term in topics_final[topic]: print (term) print ('\n')
现在我有一个包含新客户评论的数据框,我希望预测它们属于哪个主题集群。我已经搜索了一些答案,大多数推荐了以下方法,参见Spark MLlib LDA, how to infer the topics distribution of a new unseen document?。
newDocuments: RDD[(Long, Vector)] = ...topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)
然而,我得到了以下错误:
‘LDAModel’ 对象没有属性 ‘toLocal’。也没有 topicDistribution 属性。
所以这些属性在Spark 2.1.1中不被支持吗?
那么有没有其他方法从未见数据中推断主题?
回答:
你需要对新数据进行预处理:
# 导入一个新的数据集,通过预训练的LDA模型进行处理data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");data_new = data_new.dropna()data_text_new = data_new[['Your Target Column']]data_text_new['index'] = data_text_new.indexdocuments_new = data_text_new#documents_new = documents.dropna(subset=['Preprocessed Document'])# 通过词形还原和停用词函数处理新数据集processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)# 创建单个词的字典并过滤字典dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)# 定义词袋模型bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]
然后你可以将其作为函数传递给训练好的LDA模型。你只需要那个词袋模型:
ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
如果你想将其输出到csv文件中,可以尝试这样做:
a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]b = data_text_newtopic_0=[]topic_1=[]topic_2=[]for i in a: topic_0.append(i[0][1]) topic_1.append(i[1][1]) topic_2.append(i[2][1]) d = {'Your Target Column': b['Your Target Column'].tolist(), 'topic_0': topic_0, 'topic_1': topic_1, 'topic_2': topic_2} df = pd.DataFrame(data=d)df.to_csv("YourAllocated.csv", index=True, mode = 'a')
希望这对你有帮助 🙂