看起来相似的两个Python循环,但输出结果却不同?

昨天我在尝试完成Udacity的第11课,关于文本向量化的内容。我浏览了代码,一切看起来运行正常——我处理了一些电子邮件,打开它们,移除了一些签名词,并将每个电子邮件的词干化词汇返回到一个列表中。

这是循环1:

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:    for path in from_person:        ### only look at first 200 emails when developing        ### once everything is working, remove this line to run over full dataset#        temp_counter += 1    if temp_counter < 200:        path = os.path.join('/xxx', path[:-1])        email = open(path, "r")        ### use parseOutText to extract the text from the opened email        email_stemmed = parseOutText(email)        ### use str.replace() to remove any instances of the words        ### ["sara", "shackleton", "chris", "germani"]        email_stemmed.replace("sara","")        email_stemmed.replace("shackleton","")        email_stemmed.replace("chris","")        email_stemmed.replace("germani","")    ### append the text to word_data    word_data.append(email_stemmed.replace('\n', ' ').strip())    ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris        if from_person == "sara":            from_data.append(0)        elif from_person == "chris":            from_data.append(1)    email.close()

这是循环2:

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:    for path in from_person:        ### only look at first 200 emails when developing        ### once everything is working, remove this line to run over full dataset#        temp_counter += 1        if temp_counter < 200:            path = os.path.join('/xxx', path[:-1])            email = open(path, "r")            ### use parseOutText to extract the text from the opened email            stemmed_email = parseOutText(email)            ### use str.replace() to remove any instances of the words            ### ["sara", "shackleton", "chris", "germani"]            signature_words = ["sara", "shackleton", "chris", "germani"]            for each_word in signature_words:                stemmed_email = stemmed_email.replace(each_word, '')         #careful here, dont use another variable, I did and broke my head to solve it            ### append the text to word_data            word_data.append(stemmed_email)            ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris            if name == "sara":                from_data.append(0)            else: # its chris                from_data.append(1)            email.close()

代码的下一部分按预期工作:

print("emails processed")from_sara.close()from_chris.close()pickle.dump( word_data, open("/xxx/your_word_data.pkl", "wb") )pickle.dump( from_data, open("xxx/your_email_authors.pkl", "wb") )print("Answer to Lesson 11 quiz 19: ")print(word_data[152])### in Part 4, do TfIdf vectorization herefrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_extraction import stop_wordsprint("SKLearn has this many Stop Words: ")print(len(stop_words.ENGLISH_STOP_WORDS))vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)vectorizer.fit_transform(word_data)feature_names = vectorizer.get_feature_names()print('Number of different words: ')print(len(feature_names))

但是当我用循环1计算总词数时,得到了错误的结果。用循环2计算时,我得到了正确的结果。

我已经看了这段代码太久了,我找不到其中的区别——我在循环1中做错了什么?

为了记录,我一直得到的错误答案是38825。正确答案应该是38757。

非常感谢您的帮助,亲切的陌生人!


回答:

这些行没有任何作用:

email_stemmed.replace("sara","")email_stemmed.replace("shackleton","")email_stemmed.replace("chris","")email_stemmed.replace("germani","")

replace返回一个新字符串,并不修改email_stemmed。相反,你应该将返回值设置为email_stemmed

email_stemmed = email_stemmed.replace("sara", "")

依此类推。

循环二实际上在for循环中设置了返回值:

for each_word in signature_words:    stemmed_email = stemmed_email.replace(each_word, '')

上面的代码片段并不等同,因为在第一个片段的末尾,email_stemmed完全没有改变,因为replace使用不当,而在第二个片段的末尾,stemmed_email实际上已经被去除了每个词。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注