我的标记化函数遇到了问题。坦白说,我感到很迷茫,因为我并不完全理解变换器库内部的运作。我想要做的是:
我想对BLOOM模型进行微调,以创建一个对话机器人。现在,在进行标记化时,我不太理解发生了什么,因此也不知道数据应该如何呈现。我在网上找到的所有例子都是使用纯文本的,但没有一个涉及到使用数据集进行对话训练的话题。
在HuggingFace的示例中,他们只是在标记化函数的末尾添加了['text']
。由于我没有’text’这个特征,而是有['dialog']
,我以为替换这里就可以了。但显然,这行不通。
如果有人能解释一下我的代码中到底出了什么问题以及如何修复,我将不胜感激。因为在接下来的几个月里,我想训练各种模型,解释错误将对未来有很大帮助。
这是我的代码,下面是具体的错误以及我的笔记本:
import torchimport randomimport numpy as npfrom transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArgumentsimport datasets# Laden des Modells und des Tokenizersmodel_name = "bigscience/bloom-560m"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name)# Laden des Datasetsdataset = datasets.load_dataset('conv_ai_2')# Tokenisieren des Datasetsdef tokenize_function(examples): return tokenizer(examples["dialog"])tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4)# Aufteilen in Trainings- und Validierungssettrain_dataset = tokenized_dataset['train']val_dataset = tokenized_dataset['valid']# Trainingsargumentetraining_args = TrainingArguments( output_dir='./results', evaluation_strategy = "epoch", num_train_epochs=1, per_device_train_batch_size=2, per_device_eval_batch_size=2, logging_steps=500, save_steps=500, seed=42, learning_rate=5e-5, report_to="none")# Trainer-Objekttrainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset)# Finetuning des Modellstrainer.train()# Generieren einer Antwortdef generate_response(input_text, model, tokenizer): input_ids = tokenizer.encode(input_text, return_tensors='pt') chat_history_ids = model.generate( input_ids=input_ids, max_length=1000, do_sample=True, top_p=0.9, top_k=50 ) return tokenizer.decode(chat_history_ids[0], skip_special_tokens=True)# Testen des Conversational Botswhile True: user_input = input("You: ") response = generate_response(user_input, model, tokenizer) print("Bot: " + response)
错误:
---------------------------------------------------------------------------RemoteTraceback Traceback (most recent call last)RemoteTraceback: """Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/usr/local/lib/python3.9/dist-packages/datasets/utils/py_utils.py", line 1349, in _write_generator_to_queue for i, result in enumerate(func(**kwargs)): File "/usr/local/lib/python3.9/dist-packages/datasets/arrow_dataset.py", line 3329, in _map_single batch = apply_function_on_filtered_inputs( File "/usr/local/lib/python3.9/dist-packages/datasets/arrow_dataset.py", line 3210, in apply_function_on_filtered_inputs processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "<ipython-input-18-25d239b4d59f>", line 17, in tokenize_function return tokenizer(examples["dialog"]) File "/usr/local/lib/python3.9/dist-packages/datasets/formatting/formatting.py", line 280, in __getitem__ value = self.data[key]KeyError: 'dialog'"""The above exception was the direct cause of the following exception:KeyError Traceback (most recent call last)<ipython-input-18-25d239b4d59f> in <module> 17 return tokenizer(examples["dialog"]) 18 ---> 19 tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4) 20 21 # Aufteilen in Trainings- und Validierungsset13 frames/usr/local/lib/python3.9/dist-packages/datasets/formatting/formatting.py in __getitem__() 278 279 def __getitem__(self, key):--> 280 value = self.data[key] 281 if key in self.keys_to_format: 282 value = self.format(key)KeyError: 'dialog'
回答:
在原始的tokenize_function
中,你直接对examples中的”dialog”键进行标记化。然而,这并不能确保输入和标签张量的维度是一致的。这种维度的不匹配导致了你在训练过程中遇到的错误。我将每个对话条目转换为一个字符串,通过连接每个对话中的”text”键值。然后我对对话字符串进行标记化,适当截断、填充,并指定最大长度。这样就创建了维度一致的标记化输入张量。然后我将input_ids向后移动一个位置。这意味着模型将学会预测序列中的下一个标记。我还克隆了移动后的input_ids,以避免就地修改原始张量。
def tokenize_function(examples): dialog_texts = [' '.join([entry["text"] for entry in dialog]) for dialog in examples["dialog"]] tokenized = tokenizer(dialog_texts, truncation=True, padding='max_length', max_length=128, return_tensors="pt") tokenized["labels"] = tokenized.input_ids[:, 1:].clone() tokenized.input_ids = tokenized.input_ids[:, :-1] tokenized["labels"] = torch.cat([tokenized.labels, torch.full((tokenized.labels.size(0), 1), tokenizer.pad_token_id, dtype=torch.long)], dim=1) return tokenized