有人可以解释一下我在运行《用Python构建机器学习系统》书中第4章的blei_lda.py文件时遇到的不支持的操作数错误吗?

我一直在尝试运行《用Python构建机器学习系统》书中第4章的blei_lda.py文件,但一直没有成功。我使用的是Python 2.7和Enthought Canopy GUI。下面是作者提供的实际文件,但GitHub上也有多个副本。

GitHub仓库

问题是我不断收到以下错误:

TypeError                                 Traceback (most recent call last)c:\users\matt\desktop\pythonprojects\pml\ch04\blei_lda.py in <module>()    for ti in range(model.num_topics):        words = model.show_topic(ti, 64) ------>tf = sum(f for f, w in words)        with open('topics.txt', 'w') as output:        output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for f, w in words))        output.write("\n\n\n")TypeError: unsupported operand type(s) for +: 'int' and 'unicode' 

我尝试过创建一个解决方案,但没有找到完全有效的方法。

我也在网上和Stack Overflow上搜索了解决方案,但似乎只有我在运行这个文件时遇到麻烦。

# This code is supporting material for the book# Building Machine Learning Systems with Python# by Willi Richert and Luis Pedro Coelho# published by PACKT Publishing## It is made available under the MIT Licensefrom __future__ import print_functionfrom wordcloud import create_cloudtry:    from gensim import corpora, models, matutilsexcept:    print("import gensim failed.")    print()    print("Please install it")    raiseimport matplotlib.pyplot as pltimport numpy as npfrom os import pathNUM_TOPICS = 100# Check that data existsif not path.exists('./data/ap/ap.dat'):    print('Error: Expected data to be present at data/ap/')    print('Please cd into ./data & run ./download_ap.sh')# Load the datacorpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt')# Build the topic modelmodel = models.ldamodel.LdaModel(    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=None)# Iterate over all the topics in the modelfor ti in range(model.num_topics):    words = model.show_topic(ti, 64)    tf = sum(f for f, w in words)    with open('topics.txt', 'w') as output:        output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for f, w in words))        output.write("\n\n\n")# We first identify the most discussed topic, i.e., the one with the# highest total weighttopics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics)weight = topics.sum(1)max_topic = weight.argmax()# Get the top 64 words for this topic# Without the argument, show_topic would return only 10 wordswords = model.show_topic(max_topic, 64)# This function will actually check for the presence of pytagcloud and is otherwise a no-opcreate_cloud('cloud_blei_lda.png', words)num_topics_used = [len(model[doc]) for doc in corpus]fig,ax = plt.subplots()ax.hist(num_topics_used, np.arange(42))ax.set_ylabel('Nr of documents')ax.set_xlabel('Nr of topics')fig.tight_layout()fig.savefig('Figure_04_01.png')# Now, repeat the same exercise using alpha=1.0# You can edit the constant below to play around with this parameterALPHA = 1.0model1 = models.ldamodel.LdaModel(    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=ALPHA)num_topics_used1 = [len(model1[doc]) for doc in corpus]fig,ax = plt.subplots()ax.hist([num_topics_used, num_topics_used1], np.arange(42))ax.set_ylabel('Nr of documents')ax.set_xlabel('Nr of topics')# The coordinates below were fit by trial and error to look goodax.text(9, 223, r'default alpha')ax.text(26, 156, 'alpha=1.0')fig.tight_layout()fig.savefig('Figure_04_02.png')

回答:

在这一行:words = model.show_topic(ti, 64),words是一个包含元组的列表(unicode,float64)

例如:[(u'school', 0.029515796999228502),(u'prom', 0.018586355008452897)]

所以在这一行tf = sum(f for f, w in words)中,f代表unicode,而w代表浮点值。你试图对unicode值进行求和,这会导致不支持的操作数类型错误。

将这一行修改为tf = sum(f for w, f in words),这样它将对浮点值进行求和。

出于同样的原因,也要修改这一行output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for w, f in words))

所以代码片段将如下所示:

for ti in range(model.num_topics):    words = model.show_topic(ti, 64)    tf = sum(f for w, f in words)    with open('topics.txt', 'w') as output:    output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for w, f in words))    output.write("\n\n\n")

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注