我们已经准备好了一个能够识别自定义命名实体的模型。问题是,如果提供整个文档,模型的表现不符合预期,但如果只提供几句话,它就能给出惊人的结果。
我想选择标记实体前后的两句话。
例如,如果文档的一部分包含世界上的城市Colombo(被标记为GPE),我需要选择标记前后的两句话。我尝试了几种方法,但复杂度太高了。
在spacy中是否有内置的方法可以解决这个问题?
我正在使用Python和spacy。
我尝试通过识别标记的索引来解析文档,但这种方法非常慢。
回答:
值得尝试改进自定义命名实体识别器,因为额外的上下文通常不应该影响性能,如果你能解决这个问题,整体效果可能会更好。
然而,关于你具体提到的周围句子的问题:
Token
或Span
(实体是一个Span
)有一个.sent
属性,可以给你覆盖该实体的句子作为一个Span
。如果你查看给定句子开始/结束标记之前/之后的标记,你可以获取文档中任何标记的前一个/后一个句子。
import spacydef get_previous_sentence(doc, token_index): if doc[token_index].sent.start - 1 < 0: return None return doc[doc[token_index].sent.start - 1].sentdef get_next_sentence(doc, token_index): if doc[token_index].sent.end + 1 >= len(doc): return None return doc[doc[token_index].sent.end + 1].sentnlp = spacy.load('en_core_web_lg')text = "Jane is a name. Here is a sentence. Here is another sentence. Jane was the mayor of Colombo in 2010. Here is another filler sentence. And here is yet another padding sentence without entities. Someone else is the mayor of Colombo right now."doc = nlp(text)for ent in doc.ents: print(ent, ent.label_, ent.sent) print("Prev:", get_previous_sentence(doc, ent.start)) print("Next:", get_next_sentence(doc, ent.start)) print("----")
输出:
Jane PERSON Jane is a name.Prev: NoneNext: Here is a sentence.----Jane PERSON Jane was the mayor of Colombo in 2010.Prev: Here is another sentence.Next: Here is another filler sentence.----Colombo GPE Jane was the mayor of Colombo in 2010.Prev: Here is another sentence.Next: Here is another filler sentence.----2010 DATE Jane was the mayor of Colombo in 2010.Prev: Here is another sentence.Next: Here is another filler sentence.----Colombo GPE Someone else is the mayor of Colombo right now.Prev: And here is yet another padding sentence without entities.Next: None----