我在尝试使用python3运行word2vec,但由于我的数据集太大,无法轻易装入内存,因此我通过迭代器(从zip文件中)加载数据。然而,当我运行时,我得到了以下错误
Traceback (most recent call last): File "WordModel.py", line 85, in <module> main() File "WordModel.py", line 15, in main word2vec = gensim.models.Word2Vec(data,workers=cpu_count()) File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 783, in __init__ fast_version=FAST_VERSION) File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 759, in __init__ self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule) File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 936, in build_vocab sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule) File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1591, in scan_vocab total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule) File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1576, in _scan_vocab total_words += len(sentence)TypeError: object of type 'generator' has no len()
这是代码:
import zipfileimport osfrom ast import literal_evalfrom lxml import etreeimport ioimport gensimfrom multiprocessing import cpu_countdef main(): data = TrainingData("/media/thijser/Data/DataSets/uit2") print(len(data)) word2vec = gensim.models.Word2Vec(data,workers=cpu_count()) word2vec.save('word2vec.save')class TrainingData: size=-1 def __init__(self, dirname): self.data_location = dirname def __len__(self): if self.size<0: for zipfile in self.get_zips_in_folder(self.data_location): for text_file in self.get_files_names_from_zip(zipfile): self.size=self.size+1 return self.size def __iter__(self): #might not fit in memory otherwise yield self.get_data() def get_data(self): for zipfile in self.get_zips_in_folder(self.data_location): for text_file in self.get_files_names_from_zip(zipfile): yield self.preproccess_text(text_file) def stripXMLtags(self,text): tree=etree.parse(text) notags=etree.tostring(tree, encoding='utf8', method='text') return notags.decode("utf-8") def remove_newline(self,text): text.replace("\\n"," ") return text def preproccess_text(self,text): text=self.stripXMLtags(text) text=self.remove_newline(text) return text def get_files_names_from_zip(self,zip_location): files=[] archive = zipfile.ZipFile(zip_location, 'r') for info in archive.infolist(): files.append(archive.open(info.filename)) return files def get_zips_in_folder(self,location): zip_files = [] for root, dirs, files in os.walk(location): for name in files: if name.endswith((".zip")): filepath=root+"/"+name zip_files.append(filepath) return zip_filesmain()for d in data: for dd in d : print(type(dd))
确实显示dd是字符串类型,并且包含正确的前处理字符串(每个字符串的长度在50到5000字之间)。
回答:
讨论后的更新:
你的TrainingData
类的__iter__()
函数并没有提供一个逐个返回文本的生成器,而是提供了一个返回另一个生成器的生成器(这里多了一层yield
)。这并不是Word2Vec
所期望的。
将__iter__()
方法的主体更改为简单地…
return self.get_data()
…这样__iter__()
就成为了你的get_data()
的同义词,只返回get_data()
所提供的文本生成器,这应该会有所帮助。
原始回答:
你没有展示TrainingData.preproccess_text()
(原文如此)方法,这个方法是在get_data()
内部引用的,它实际上是创建Word2Vec
处理的数据的。并且,正是这些数据引发了错误。
Word2Vec
要求它的sentences
语料库是一个可迭代序列(生成器在这里是合适的),其中每个单独的项目是一个字符串标记列表。
从那个错误来看,你的TrainingData
序列中的单独项目本身可能是生成器,而不是具有可读len()
的列表。
(另外,如果你选择在这里使用生成器是因为单个文本可能非常长,请注意gensim的Word2Vec
及其相关类只对长度最多10000个词标记的单个文本进行训练。超过第10000个词的任何词都将被默默忽略。如果这是一个问题,你的源文本应该预先分成不超过10000个标记的单个文本。)