我正在尝试使用Doc2Vec创建一个无监督的文本分类器。我遇到了一个错误消息:“在’keywords_list’中的以下关键词对于Doc2Vec模型是未知的,因此不会用于训练模型:sport”。错误消息的其余部分非常长,但最终以“无法在无输入的情况下计算相似度”结束,这让我认为我的所有关键词都没有被接受。
我创建Lbl2Vec模型的那部分代码是
Lbl2Vec_model = Lbl2Vec(keywords_list=list(labels["keywords"]), tagged_documents=df_pdfs['tagged_docs'], label_names=list(labels["class_name"]), similarity_threshold=0.43, min_num_docs=10, epochs=10)
我一直在遵循这个教程 https://towardsdatascience.com/unsupervised-text-classification-with-lbl2vec-6c5e040354de,但使用的是我从JSON文件中加载的自己的数据集。
完整的代码如下所示:
#importsfrom ast import keywordimport tkinter as tkfrom tkinter import filedialogimport jsonimport pandas as pdimport numpy as npfrom gensim.utils import simple_preprocessfrom gensim.parsing.preprocessing import strip_tagsfrom gensim.models.doc2vec import TaggedDocumentfrom lbl2vec import Lbl2Vecclass DocumentVectorize: def __init__(self): pass #imports data from json file def import_from_json (self): root = tk.Tk() root.withdraw() json_file_path = filedialog.askopenfile().name with open(json_file_path, "r") as json_file: try: json_load = json.load(json_file) except: raise ValueError("No PDFs to convert to JSON") self.pdfs = json_loadif __name__ == "__main__":#tokenizes documents def tokenize(doc): return simple_preprocess(strip_tags(doc), deacc=True, min_len=2, max_len=1000000) #initializes document vectorization class and imports the data from json vectorizer = DocumentVectorize() vectorizer.import_from_json() #converts json data to dataframe df_pdfs = pd.DataFrame.from_dict(vectorizer.pdfs) #creates data frame that contains the keywords and their class names for the training labels = {"keywords": [["sport"], ["physics"]], "class_name": ["rec.sport", "rec.physics"]} labels = pd.DataFrame.from_dict(labels) #applies tokenization to documents df_pdfs['tagged_docs'] = df_pdfs.apply(lambda row: TaggedDocument(tokenize(row['text_clean']), [str(row.name)]), axis=1) #creates key for documents df_pdfs['doc_key'] = df_pdfs.index.astype(str) print(df_pdfs.head()) #Initializes Lbl2vec model Lbl2Vec_model = Lbl2Vec(keywords_list=list(labels["keywords"]), tagged_documents=df_pdfs['tagged_docs'], label_names=list(labels["class_name"]), similarity_threshold=0.43, min_num_docs=10, epochs=10) #Fits Lbl2vec model to data Lbl2Vec_model.fit()
回答:
发生的问题是Lbl2Vec模型的默认初始化设置了单词的最小计数为50。这意味着一个单词需要出现50次才能被包含在词向量化集合中。在我的任何文档中都没有单词达到这个频率,因此所有关键词都被拒绝了,这导致了我收到的错误消息。这就是那行代码更新后的样子:
Lbl2Vec_model = Lbl2Vec(keywords_list=list(labels["keywords"]), tagged_documents=df_pdfs['tagged_docs'], label_names=list(labels["class_name"]), min_count = 2, similarity_threshold=0.43, min_num_docs=10, epochs=10)