我正在学习《编程集体智慧》这本书。以下是我的代码:
import feedparserimport re# 返回RSS订阅源的标题和单词计数字典def getwordcounts(url): # 解析订阅源 d = feedparser.parse(url) wc={} # 遍历所有条目 for e in d.entries: if 'summary' in e: summary = e.summary else: summary = e.description # 提取单词列表 words = getwords(e.title + '' + summary) for word in words: wc.setdefault(word, 0) wc[word] += 1 return d.feed.title, wcdef getwords(html): # 移除所有HTML标签 txt = re.compile(r'[^>]+>').sub('',html) # 通过所有非字母字符分割单词 words = re.compile(r'[^A-Z^a-z]+').split(txt) # 转换为小写 return [word.lower() for word in words if word!='']apcount = {}wordcounts = {}for feedurl in file('feedlist.txt'): title, wc = getwordcounts(feedurl) wordcounts[title] = wc for word, count in wc.items(): apcount.setdefault(word, 0) if count>1: apcount[word] += 1wordlist = []for w, bc in apcount.items(): frac = float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w)out = file('blogdata.txt', 'w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog, wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.wirte('\t%d' % wc[word]) else: out.write('\t0') out.write('\n')
当我运行这个脚本时,我得到了以下消息:
Traceback (most recent call last): File "generatefeedvector.py", line 38, in <module> title, wc = getwordcounts(feedurl) File "generatefeedvector.py", line 22, in getwordcounts return d.feed.title, wc File "build/bdist.linux-x86_64/egg/feedparser.py", line 416, in __getattr__AttributeError: object has no attribute 'title'
我已经检查了feedparser的版本是5.1.3。
那么如何解决这个问题呢?谢谢
回答:
你尝试用feedparser
解析的URL要么不是一个有效的订阅源(可以用feedvalidator
检查),而是一个网页,要么订阅源是空的,或者title
是空的。
作为一种解决方法,可以使用getattr()
:
return getattr(d.feed, 'title', 'Unknown title'), wc
另见: