我想用新的实体更新模型。我正在加载“pt”NER模型,并尝试更新它。在做任何事情之前,我尝试了这句话:“meu nome é Mário e hoje eu vou para academia”。(这句话的英文是“my name is Mário and today I’m going to go to gym”)。在整个过程开始之前,我得到了以下结果:
Entities [('Mário', 'PER')]Tokens [('meu', '', 2), ('nome', '', 2), ('é', '', 2), ('Mário', 'PER', 3), ('e', '', 2), ('hoje', '', 2), ('eu', '', 2), ('vou', '', 2), ('pra', '', 2), ('academia', '', 2)]
好的,Mário是一个名字,这是正确的。但我想让模型识别“hoje(今天)”为DATE,于是我运行了下面的脚本。
运行脚本后,我尝试了同样的句子,得到了以下结果:
Entities [('hoje', 'DATE')]Tokens [('meu', '', 2), ('nome', '', 2), ('é', '', 2), ('Mário', '', 2), ('e', '', 2), ('hoje', 'DATE', 3), ('eu', '', 2), ('vou', '', 2), ('pra', '', 2), ('academia', '', 2)]
模型识别“hoje”为DATE,但完全忘记了Mário作为Person的识别。
from __future__ import unicode_literals, print_functionimport placimport randomfrom pathlib import Pathimport spacyfrom spacy.util import minibatch, compounding# training dataTRAIN_DATA = [ ("Infelizmente não, eu briguei com meus amigos hoje", {"entities": [(45, 49, "DATE")]}), ("hoje foi um bom dia.", {"entities": [(0, 4, "DATE")]}), ("ah não sei, hoje foi horrível", {"entities": [(12, 16, "DATE")]}), ("hoje eu briguei com o Mário", {"entities": [(0, 4, "DATE")]})]@plac.annotations( model=("Model name. Defaults to blank 'en' model.", "option", "m", str), output_dir=("Optional output directory", "option", "o", Path), n_iter=("Number of training iterations", "option", "n", int),)def main(model="pt", output_dir="/model", n_iter=100): """Load the model, set up the pipeline and train the entity recognizer.""" if model is not None: nlp = spacy.load(model) # load existing spaCy model print("Loaded model '%s'" % model) else: nlp = spacy.blank("pt") # create blank Language class print("Created blank 'en' model") doc = nlp("meu nome é Mário e hoje eu vou pra academia") print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) # create the built-in pipeline components and add them to the pipeline # nlp.create_pipe works for built-ins that are registered with spaCy if "ner" not in nlp.pipe_names: ner = nlp.create_pipe("ner") nlp.add_pipe(ner, last=True) # otherwise, get it so we can add labels else: ner = nlp.get_pipe("ner") # add labels for _, annotations in TRAIN_DATA: for ent in annotations.get("entities"): ner.add_label(ent[2]) # get names of other pipes to disable them during training other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] with nlp.disable_pipes(*other_pipes): # only train NER # reset and initialize the weights randomly – but only if we're # training a new model if model is None: nlp.begin_training() for itn in range(n_iter): random.shuffle(TRAIN_DATA) losses = {} # batch up the examples using spaCy's minibatch batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) for batch in batches: texts, annotations = zip(*batch) nlp.update( texts, # batch of texts annotations, # batch of annotations drop=0.5, # dropout - make it harder to memorise data losses=losses, ) print("Losses", losses) # test the trained model # for text, _ in TRAIN_DATA: doc = nlp("meu nome é Mário e hoje eu vou pra academia") print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) # save model to output directory if output_dir is not None: output_dir = Path(output_dir) if not output_dir.exists(): output_dir.mkdir() nlp.to_disk(output_dir) print("Saved model to", output_dir) # test the saved model print("Loading from", output_dir) nlp2 = spacy.load(output_dir) # for text, _ in TRAIN_DATA: # doc = nlp2(text) # print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) # print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
回答:
在训练数据中,你需要提到“Mario”为“PER”。如果你遗漏了,模型会从新的训练数据中学习,将“Mario”排除在“PER”之外。
(注意:在训练数据中,你应该提到句子中存在的所有实体,而不仅仅是新的实体。)