我一直在尝试运行《用Python构建机器学习系统》书中第4章的blei_lda.py文件,但一直没有成功。我使用的是Python 2.7和Enthought Canopy GUI。下面是作者提供的实际文件,但GitHub上也有多个副本。
问题是我不断收到以下错误:
TypeError Traceback (most recent call last)c:\users\matt\desktop\pythonprojects\pml\ch04\blei_lda.py in <module>() for ti in range(model.num_topics): words = model.show_topic(ti, 64) ------>tf = sum(f for f, w in words) with open('topics.txt', 'w') as output: output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for f, w in words)) output.write("\n\n\n")TypeError: unsupported operand type(s) for +: 'int' and 'unicode'
我尝试过创建一个解决方案,但没有找到完全有效的方法。
我也在网上和Stack Overflow上搜索了解决方案,但似乎只有我在运行这个文件时遇到麻烦。
# This code is supporting material for the book# Building Machine Learning Systems with Python# by Willi Richert and Luis Pedro Coelho# published by PACKT Publishing## It is made available under the MIT Licensefrom __future__ import print_functionfrom wordcloud import create_cloudtry: from gensim import corpora, models, matutilsexcept: print("import gensim failed.") print() print("Please install it") raiseimport matplotlib.pyplot as pltimport numpy as npfrom os import pathNUM_TOPICS = 100# Check that data existsif not path.exists('./data/ap/ap.dat'): print('Error: Expected data to be present at data/ap/') print('Please cd into ./data & run ./download_ap.sh')# Load the datacorpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt')# Build the topic modelmodel = models.ldamodel.LdaModel( corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=None)# Iterate over all the topics in the modelfor ti in range(model.num_topics): words = model.show_topic(ti, 64) tf = sum(f for f, w in words) with open('topics.txt', 'w') as output: output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for f, w in words)) output.write("\n\n\n")# We first identify the most discussed topic, i.e., the one with the# highest total weighttopics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics)weight = topics.sum(1)max_topic = weight.argmax()# Get the top 64 words for this topic# Without the argument, show_topic would return only 10 wordswords = model.show_topic(max_topic, 64)# This function will actually check for the presence of pytagcloud and is otherwise a no-opcreate_cloud('cloud_blei_lda.png', words)num_topics_used = [len(model[doc]) for doc in corpus]fig,ax = plt.subplots()ax.hist(num_topics_used, np.arange(42))ax.set_ylabel('Nr of documents')ax.set_xlabel('Nr of topics')fig.tight_layout()fig.savefig('Figure_04_01.png')# Now, repeat the same exercise using alpha=1.0# You can edit the constant below to play around with this parameterALPHA = 1.0model1 = models.ldamodel.LdaModel( corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=ALPHA)num_topics_used1 = [len(model1[doc]) for doc in corpus]fig,ax = plt.subplots()ax.hist([num_topics_used, num_topics_used1], np.arange(42))ax.set_ylabel('Nr of documents')ax.set_xlabel('Nr of topics')# The coordinates below were fit by trial and error to look goodax.text(9, 223, r'default alpha')ax.text(26, 156, 'alpha=1.0')fig.tight_layout()fig.savefig('Figure_04_02.png')
回答:
在这一行:words = model.show_topic(ti, 64)
,words是一个包含元组的列表(unicode,float64)
例如:[(u'school', 0.029515796999228502),(u'prom', 0.018586355008452897)]
所以在这一行tf = sum(f for f, w in words)
中,f代表unicode,而w代表浮点值。你试图对unicode值进行求和,这会导致不支持的操作数类型错误。
将这一行修改为tf = sum(f for w, f in words)
,这样它将对浮点值进行求和。
出于同样的原因,也要修改这一行output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for w, f in words))
。
所以代码片段将如下所示:
for ti in range(model.num_topics): words = model.show_topic(ti, 64) tf = sum(f for w, f in words) with open('topics.txt', 'w') as output: output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for w, f in words)) output.write("\n\n\n")