昨天我在尝试完成Udacity的第11课,关于文本向量化的内容。我浏览了代码,一切看起来运行正常——我处理了一些电子邮件,打开它们,移除了一些签名词,并将每个电子邮件的词干化词汇返回到一个列表中。
这是循环1:
for name, from_person in [("sara", from_sara), ("chris", from_chris)]: for path in from_person: ### only look at first 200 emails when developing ### once everything is working, remove this line to run over full dataset# temp_counter += 1 if temp_counter < 200: path = os.path.join('/xxx', path[:-1]) email = open(path, "r") ### use parseOutText to extract the text from the opened email email_stemmed = parseOutText(email) ### use str.replace() to remove any instances of the words ### ["sara", "shackleton", "chris", "germani"] email_stemmed.replace("sara","") email_stemmed.replace("shackleton","") email_stemmed.replace("chris","") email_stemmed.replace("germani","") ### append the text to word_data word_data.append(email_stemmed.replace('\n', ' ').strip()) ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris if from_person == "sara": from_data.append(0) elif from_person == "chris": from_data.append(1) email.close()
这是循环2:
for name, from_person in [("sara", from_sara), ("chris", from_chris)]: for path in from_person: ### only look at first 200 emails when developing ### once everything is working, remove this line to run over full dataset# temp_counter += 1 if temp_counter < 200: path = os.path.join('/xxx', path[:-1]) email = open(path, "r") ### use parseOutText to extract the text from the opened email stemmed_email = parseOutText(email) ### use str.replace() to remove any instances of the words ### ["sara", "shackleton", "chris", "germani"] signature_words = ["sara", "shackleton", "chris", "germani"] for each_word in signature_words: stemmed_email = stemmed_email.replace(each_word, '') #careful here, dont use another variable, I did and broke my head to solve it ### append the text to word_data word_data.append(stemmed_email) ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris if name == "sara": from_data.append(0) else: # its chris from_data.append(1) email.close()
代码的下一部分按预期工作:
print("emails processed")from_sara.close()from_chris.close()pickle.dump( word_data, open("/xxx/your_word_data.pkl", "wb") )pickle.dump( from_data, open("xxx/your_email_authors.pkl", "wb") )print("Answer to Lesson 11 quiz 19: ")print(word_data[152])### in Part 4, do TfIdf vectorization herefrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_extraction import stop_wordsprint("SKLearn has this many Stop Words: ")print(len(stop_words.ENGLISH_STOP_WORDS))vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)vectorizer.fit_transform(word_data)feature_names = vectorizer.get_feature_names()print('Number of different words: ')print(len(feature_names))
但是当我用循环1计算总词数时,得到了错误的结果。用循环2计算时,我得到了正确的结果。
我已经看了这段代码太久了,我找不到其中的区别——我在循环1中做错了什么?
为了记录,我一直得到的错误答案是38825。正确答案应该是38757。
非常感谢您的帮助,亲切的陌生人!
回答:
这些行没有任何作用:
email_stemmed.replace("sara","")email_stemmed.replace("shackleton","")email_stemmed.replace("chris","")email_stemmed.replace("germani","")
replace
返回一个新字符串,并不修改email_stemmed
。相反,你应该将返回值设置为email_stemmed
:
email_stemmed = email_stemmed.replace("sara", "")
依此类推。
循环二实际上在for循环中设置了返回值:
for each_word in signature_words: stemmed_email = stemmed_email.replace(each_word, '')
上面的代码片段并不等同,因为在第一个片段的末尾,email_stemmed
完全没有改变,因为replace
使用不当,而在第二个片段的末尾,stemmed_email
实际上已经被去除了每个词。