python gensim word2vec 引发 TypeError: TypeError: object of type ‘generator’ has no len() 错误在自定义数据类上

我在尝试使用python3运行word2vec，但由于我的数据集太大，无法轻易装入内存，因此我通过迭代器（从zip文件中）加载数据。然而，当我运行时，我得到了以下错误

Traceback (most recent call last):  File "WordModel.py", line 85, in <module>    main()  File "WordModel.py", line 15, in main    word2vec = gensim.models.Word2Vec(data,workers=cpu_count())  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 783, in __init__    fast_version=FAST_VERSION)  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 759, in __init__    self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule)  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 936, in build_vocab    sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1591, in scan_vocab    total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1576, in _scan_vocab    total_words += len(sentence)TypeError: object of type 'generator' has no len()

这是代码:

import zipfileimport osfrom ast import literal_evalfrom lxml import etreeimport ioimport gensimfrom multiprocessing import cpu_countdef main():    data = TrainingData("/media/thijser/Data/DataSets/uit2")    print(len(data))    word2vec = gensim.models.Word2Vec(data,workers=cpu_count())    word2vec.save('word2vec.save')class TrainingData:    size=-1    def __init__(self, dirname):        self.data_location = dirname    def __len__(self):        if self.size<0:             for zipfile in self.get_zips_in_folder(self.data_location):                 for text_file in self.get_files_names_from_zip(zipfile):                    self.size=self.size+1        return self.size                def __iter__(self): #might not fit in memory otherwise        yield self.get_data()    def get_data(self):        for zipfile in self.get_zips_in_folder(self.data_location):             for text_file in self.get_files_names_from_zip(zipfile):                yield self.preproccess_text(text_file)    def stripXMLtags(self,text):        tree=etree.parse(text)        notags=etree.tostring(tree, encoding='utf8', method='text')        return notags.decode("utf-8")     def remove_newline(self,text):        text.replace("\\n"," ")        return text    def preproccess_text(self,text):        text=self.stripXMLtags(text)        text=self.remove_newline(text)        return text    def get_files_names_from_zip(self,zip_location):        files=[]        archive = zipfile.ZipFile(zip_location, 'r')        for info in archive.infolist():            files.append(archive.open(info.filename))        return files    def get_zips_in_folder(self,location):       zip_files = []       for root, dirs, files in os.walk(location):            for name in files:                if name.endswith((".zip")):                     filepath=root+"/"+name                    zip_files.append(filepath)       return zip_filesmain()for d in data:    for dd in d :        print(type(dd))

确实显示dd是字符串类型，并且包含正确的前处理字符串（每个字符串的长度在50到5000字之间）。

回答：

讨论后的更新:

你的TrainingData类的__iter__()函数并没有提供一个逐个返回文本的生成器，而是提供了一个返回另一个生成器的生成器（这里多了一层yield）。这并不是Word2Vec所期望的。

将__iter__()方法的主体更改为简单地…

return self.get_data()

…这样__iter__()就成为了你的get_data()的同义词，只返回get_data()所提供的文本生成器，这应该会有所帮助。

原始回答:

你没有展示TrainingData.preproccess_text()（原文如此）方法，这个方法是在get_data()内部引用的，它实际上是创建Word2Vec处理的数据的。并且，正是这些数据引发了错误。

Word2Vec要求它的sentences语料库是一个可迭代序列（生成器在这里是合适的），其中每个单独的项目是一个字符串标记列表。

从那个错误来看，你的TrainingData序列中的单独项目本身可能是生成器，而不是具有可读len()的列表。

（另外，如果你选择在这里使用生成器是因为单个文本可能非常长，请注意gensim的Word2Vec及其相关类只对长度最多10000个词标记的单个文本进行训练。超过第10000个词的任何词都将被默默忽略。如果这是一个问题，你的源文本应该预先分成不超过10000个标记的单个文本。）

学技术

python gensim word2vec 引发 TypeError: TypeError: object of type ‘generator’ has no len() 错误在自定义数据类上

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复