用于训练阿拉伯语 spacy NER 模型的 Python 代码未能产生结果或错误

这是用于训练 spacy 模型以进行命名实体识别的代码。我的数据集是阿拉伯语推文的 JSON 文件。我使用 https://dataturks.com 的机器学习工具手动标记了数据集中的位置,但代码无法运行。

我使用了这个链接中的代码https://dataturks.com/help/dataturks-ner-json-to-spacy-train.php

    ############################################  NOTE  ##########################################################           Creates NER training data in Spacy format from JSON downloaded from Dataturks.##           Outputs the Spacy training data which can be used for Spacy training.#########################################################################################################################################################################################################################def convert_dataturks_to_spacy(dataturks_JSON_FilePath):    training_data = []    lines=[]    with open(dataturks_JSON_FilePath, 'r') as f:        lines = f.readlines()        for line in lines:            data = json.loads(line)            text = data['content']            entities = []            annotations = data['annotation']            if annotations:                for annotation in annotations:                    #only a single point in text annotation.                    point = annotation['points'][0]                    labels = annotation['label']                    # handle both list of labels or a single label.                    if not isinstance(labels, list):                        labels = [labels]                    #print(labels)                    for label in labels:                        #dataturks indices are both inclusive [start, end] but spacy is not [start, end)                        entities.append((point['start'], point['end'] + 1 ,label))                training_data.append((text, {"entities" : entities}))    return training_data

训练数据

TRAIN_DATA = convert_dataturks_to_spacy("/content/drive/My Drive/Colab Notebooks/Name Entity Recognition/NERTweets.json")TRAIN_DATA

前三条推文的输出

    [('طقس حضرموت صور اوليه سيول وادي رخيه',  {'entities': [(26, 35, 'loc'), (4, 10, 'city')]}), ('سيول وادي العف قرية هدى بمديرية حبان بمحافظة شبوة جنوب اليمن اليوم الاحد مايو م تصوير عدنان القميشي',  {'entities': [(55, 60, 'country'),    (50, 54, 'pre'),    (45, 49, 'city'),    (32, 36, 'loc'),    (20, 23, 'loc'),    (5, 14, 'loc')]}), ('اول مرة قابلته جدة جاها سيول', {'entities': [(15, 18, 'city')]})]

然后训练 spacy NER 模型

import spacyimport random################### Train Spacy NER.###########def train_spacy():    TRAIN_DATA = convert_dataturks_to_spacy("/content/drive/My Drive/Colab Notebooks/Name Entity Recognition/NERTweets.json");    nlp = spacy.blank('ar')  # create blank Language class    # create the built-in pipeline components and add them to the pipeline    # nlp.create_pipe works for built-ins that are registered with spaCy    if 'ner' not in nlp.pipe_names:        ner = nlp.create_pipe('ner')        nlp.add_pipe(ner, last=True)    # add labels    for _, annotations in TRAIN_DATA:        for ent in annotations.get('entities'):            ner.add_label(ent[2])    # get names of other pipes to disable them during training    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']    with nlp.disable_pipes(*other_pipes):  # only train NER        optimizer = nlp.begin_training()        for itn in range(1):            print("Statring iteration " + str(itn))            random.shuffle(TRAIN_DATA)            losses = {}            for text, annotations in TRAIN_DATA:                nlp.update(                    [text],  # batch of texts                    [annotations],  # batch of annotations                    drop=0.2,  # dropout - make it harder to memorise data                    sgd=optimizer,  # callable to update weights                    losses=losses)            print(losses)    #do prediction    doc = nlp("Samsing mobiles below $100")    print ("Entities= " + str(["" + str(ent.text) + "_" + str(ent.label_) for ent in doc.ents]))train_spacy

输出错误

Statring iteration 0---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-8-6b61c2d740cf> in <module>()----> 1 train_spacy()2 frames/usr/local/lib/python3.6/dist-packages/spacy/language.py in _format_docs_and_golds(self, docs, golds)    470                     err = Errors.E151.format(unexp=unexpected, exp=expected_keys)    471                     raise ValueError(err)--> 472                 gold = GoldParse(doc, **gold)    473             doc_objs.append(doc)    474             gold_objs.append(gold)gold.pyx in spacy.gold.GoldParse.__init__()gold.pyx in spacy.gold.biluo_tags_from_offsets()ValueError: [E103] Trying to set conflicting doc.ents: '(42, 47, 'loc')' and '(34, 47, 'loc')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

我的结果已上传到下面的 Google Colab 链接。问题出在哪里?

https://drive.google.com/drive/folders/19t33kW4Dwtbv6s4vfMpa2kNwoVNzSu5I


回答:

spacy 不允许实体重叠,您应该移除重叠的实体,您的代码应为:

def convert_dataturks_to_spacy(dataturks_JSON_FilePath):training_data = []lines=[]with open(dataturks_JSON_FilePath, 'r') as f:    lines = f.readlines()    for line in lines:        #line=lines[0]        data = json.loads(line)        text = data['content']        entities = []                           annotations=[]        for annotation in data['annotation']:            point = annotation['points'][0]            label = annotation['label']              annotations.append((point['start'], point['end'] ,label,point['end']-point['start']))                    annotations=sorted(annotations, key=lambda student: student[3],reverse=True)         seen_tokens = set()         for annotation in annotations:            start=annotation[0]            end=annotation[1]            labels=annotation[2]            if start not in seen_tokens and end - 1 not in seen_tokens:                     seen_tokens.update(range(start, end))                 if not isinstance(labels, list):                    labels = [labels]                for label in labels:                    #dataturks indices are both inclusive [start, end] but spacy is not [start, end)                    entities.append((start, end+1  ,label))        training_data.append((text, {"entities" : entities})

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注