Spark 2.1.1:如何在已训练的Spark 2.1.1 LDA模型上预测未见文档的主题?

我在pyspark(Spark 2.1.1)上使用客户评论数据集训练了一个LDA模型。现在我想基于该模型预测新未见文本中的主题。

我使用以下代码创建模型

from pyspark import SparkConf, SparkContextfrom pyspark.sql import SparkSessionfrom pyspark.sql import SQLContext, Rowfrom pyspark.ml.feature import CountVectorizerfrom pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemoverfrom pyspark.mllib.clustering import LDA, LDAModelfrom pyspark.ml.clustering import DistributedLDAModel, LocalLDAModelfrom pyspark.mllib.linalg import Vector, Vectorsfrom pyspark.sql.functions import *import pyspark.sql.functions as Fpath = "D:/sparkdata/sample_text_LDA.txt"sc = SparkContext("local[*]", "review")spark = SparkSession.builder.appName('Basics').getOrCreate()df = spark.read.csv("D:/sparkdata/customers_data.csv", header=True, inferSchema=True)data = df.select("Reviews").rdd.map(list).map(lambda x: x[0]).zipWithIndex().map(lambda words: Row(idd= words[1], words = words[0].split(" "))).collect()docDF = spark.createDataFrame(data)remover = StopWordsRemover(inputCol="words",outputCol="stopWordsRemoved")stopWordsRemoved_df = remover.transform(docDF).cache()Vector = CountVectorizer(inputCol="stopWordsRemoved", outputCol="vectors")model = Vector.fit(stopWordsRemoved_df)result = model.transform(stopWordsRemoved_df)corpus = result.select("idd", "vectors").rdd.map(lambda x: [x[0],Vectors.fromML(x[1])]).cache()# Cluster the documents topics using LDAldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')topics = ldaModel.topicsMatrix()vocabArray = model.vocabularyprint(ldaModel.describeTopics())wordNumbers = 10  # number of words per topictopicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))def topic_render(topic):  # specify vector id of words to actual words   terms = topic[0]   result = []   for i in range(wordNumbers):       term = vocabArray[terms[i]]       result.append(term)   return resulttopics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()for topic in range(len(topics_final)):   print("Topic" + str(topic) + ":")   for term in topics_final[topic]:       print (term)   print ('\n')

现在我有一个包含新客户评论的数据框,我希望预测它们属于哪个主题集群。我已经搜索了一些答案,大多数推荐了以下方法,参见Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

newDocuments: RDD[(Long, Vector)] = ...topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

然而,我得到了以下错误:

‘LDAModel’ 对象没有属性 ‘toLocal’。也没有 topicDistribution 属性。

所以这些属性在Spark 2.1.1中不被支持吗?

那么有没有其他方法从未见数据中推断主题?


回答:

你需要对新数据进行预处理:

# 导入一个新的数据集,通过预训练的LDA模型进行处理data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");data_new = data_new.dropna()data_text_new = data_new[['Your Target Column']]data_text_new['index'] = data_text_new.indexdocuments_new = data_text_new#documents_new = documents.dropna(subset=['Preprocessed Document'])# 通过词形还原和停用词函数处理新数据集processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)# 创建单个词的字典并过滤字典dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)# 定义词袋模型bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]

然后你可以将其作为函数传递给训练好的LDA模型。你只需要那个词袋模型:

ldamodel[bow_corpus_new[:len(bow_corpus_new)]]

如果你想将其输出到csv文件中,可以尝试这样做:

a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]b = data_text_newtopic_0=[]topic_1=[]topic_2=[]for i in a:    topic_0.append(i[0][1])    topic_1.append(i[1][1])    topic_2.append(i[2][1])    d = {'Your Target Column': b['Your Target Column'].tolist(),     'topic_0': topic_0,     'topic_1': topic_1,     'topic_2': topic_2}     df = pd.DataFrame(data=d)df.to_csv("YourAllocated.csv", index=True, mode = 'a')

希望这对你有帮助 🙂

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注