为什么”machine_learning”词形还原既是”machine_learning”又是”machine_learne”?

我在多个文本上运行LDA。当我生成一些关于生成的主题的可视化时,我发现”machine_learning”这个双词组被词形还原为”machine_learning”和”machine_learne”。这是我能提供的最小可复现示例:

import en_core_web_smtokenized = [    [        'artificially_intelligent', 'funds', 'generating', 'excess', 'returns',        'artificial_intelligence', 'deep_learning', 'compelling', 'reasons',        'join_us', 'artificially_intelligent', 'fund', 'develop', 'ai',        'machine_learning', 'capabilities', 'real', 'cases', 'big', 'players',        'industry', 'discover', 'emerging', 'trends', 'latest_developments',        'ai', 'machine_learning', 'industry', 'players', 'trading',        'investing', 'live', 'investment', 'models', 'learn', 'develop',        'compelling', 'business', 'case', 'clients', 'ceos', 'adopt', 'ai',        'machine_learning', 'investment', 'approaches', 'rare', 'gathering',        'talents', 'including', 'quants', 'data_scientists', 'researchers',        'ai', 'machine_learning', 'experts', 'investment_officers', 'explore',        'solutions', 'challenges', 'potential', 'risks', 'pitfalls',        'adopting', 'ai', 'machine_learning'    ],    [        'recent_years', 'topics', 'data_science', 'artificial_intelligence',        'machine_learning', 'big_data', 'become_increasingly', 'popular',        'growth', 'fueled', 'collection', 'availability', 'data',        'continually', 'increasing', 'processing', 'power', 'storage', 'open',        'source', 'movement', 'making', 'tools', 'widely', 'available',        'result', 'already', 'witnessed', 'profound', 'changes', 'work',        'rest', 'play', 'trend', 'increase', 'world', 'finance', 'impacted',        'investment', 'managers', 'particular', 'join_us', 'explore',        'data_science', 'means', 'finance_professionals'    ]]nlp = en_core_web_sm.load(disable=['parser', 'ner'])def lemmatization(descrips, allowed_postags=None):    if allowed_postags is None:        allowed_postags = ['NOUN', 'ADJ', 'VERB',                           'ADV']    lemmatized_descrips = []    for descrip in descrips:        doc = nlp(" ".join(descrip))        lemmatized_descrips.append([            token.lemma_ for token in doc if token.pos_ in allowed_postags        ])    return lemmatized_descripslemmatized = lemmatization(tokenized)print(lemmatized)

您会注意到,”machine_learne”在输入的tokenized中找不到,但在输出的lemmatized中找到了”machine_learning”和”machine_learne”。

这是什么原因造成的?它会对其他双词组/三词组造成问题吗?


回答:

我认为您误解了词性标注和词形还原的过程。

词性标注不仅仅基于单个词本身(我不知道您的母语是什么,但这在许多语言中都很常见),还基于周围的词(例如,一个常见的学习规则是,在许多陈述中,动词通常由名词前置,这代表了动词的主语)。

当您将所有这些“词元”传递给您的词形还原器时,spacy的词形还原器会试图“猜测”您单个词的词性类别。

在许多情况下,它会选择默认的名词,如果该词不在常见和不规则名词的查找表中,它会尝试使用通用规则(例如去掉复数的’s’)。

在其他情况下,它会基于一些模式选择默认动词(例如末尾的”-ing”),这可能是您的情况。由于没有任何词典中存在动词”machine_learning”(它的模型中没有实例),它会选择“否则”路线并应用通用规则。

因此,machine_learning可能被一个通用的“ing”变为”e”规则词形还原(例如在制作->制作,烘焙->烘焙的情况下),这对许多规则动词很常见。

看这个测试示例:

for descrip in tokenized:        doc = nlp(" ".join(descrip))        print([            (token.pos_, token.text) for token in doc        ])

输出:

[(‘NOUN’, ‘artificially_intelligent’), (‘NOUN’, ‘funds’), (‘VERB’, ‘generating’), (‘ADJ’, ‘excess’), (‘NOUN’, ‘returns’), (‘NOUN’, ‘artificial_intelligence’), (‘NOUN’, ‘deep_learning’), (‘ADJ’, ‘compelling’), (‘NOUN’, ‘reasons’), (‘PROPN’, ‘join_us’), (‘NOUN’, ‘artificially_intelligent’), (‘NOUN’, ‘fund’), (‘NOUN’, ‘develop’), (‘VERB’, ‘ai’), (‘VERB’, ‘machine_learning’), (‘NOUN’, ‘capabilities’), (‘ADJ’, ‘real’), (‘NOUN’, ‘cases’), (‘ADJ’, ‘big’), (‘NOUN’, ‘players’), (‘NOUN’, ‘industry’), (‘VERB’, ‘discover’), (‘VERB’, ’emerging’), (‘NOUN’, ‘trends’), (‘NOUN’, ‘latest_developments’), (‘VERB’, ‘ai’), (‘VERB’, ‘machine_learning’), (‘NOUN’, ‘industry’), (‘NOUN’, ‘players’), (‘NOUN’, ‘trading’), (‘VERB’, ‘investing’), (‘ADJ’, ‘live’), (‘NOUN’, ‘investment’), (‘NOUN’, ‘models’), (‘VERB’, ‘learn’), (‘VERB’, ‘develop’), (‘ADJ’, ‘compelling’), (‘NOUN’, ‘business’), (‘NOUN’, ‘case’), (‘NOUN’, ‘clients’), (‘NOUN’, ‘ceos’), (‘VERB’, ‘adopt’), (‘VERB’, ‘ai’), (‘ADJ’, ‘machine_learning’), (‘NOUN’, ‘investment’), (‘NOUN’, ‘approaches’), (‘ADJ’, ‘rare’), (‘VERB’, ‘gathering’), (‘NOUN’, ‘talents’), (‘VERB’, ‘including’), (‘NOUN’, ‘quants’), (‘NOUN’, ‘data_scientists’), (‘NOUN’, ‘researchers’), (‘VERB’, ‘ai’), (‘ADJ’, ‘machine_learning’), (‘NOUN’, ‘experts’), (‘NOUN’, ‘investment_officers’), (‘VERB’, ‘explore’), (‘NOUN’, ‘solutions’), (‘VERB’, ‘challenges’), (‘ADJ’, ‘potential’), (‘NOUN’, ‘risks’), (‘NOUN’, ‘pitfalls’), (‘VERB’, ‘adopting’), (‘VERB’, ‘ai’), (‘NOUN’, ‘machine_learning’)]

您会发现machine_learning基于上下文既作为动词又作为名词。但请注意,仅仅将词连接起来会因为它们没有按自然语言预期的顺序排列而变得混乱。

即使是人类也无法理解并正确地对这段文本进行词性标注:

artificially_intelligent funds generating excess returns artificial_intelligence deep_learning compelling reasons join_us artificially_intelligent fund develop ai machine_learning capabilities real cases big players industry discover emerging trends latest_developments ai machine_learning industry players trading investing live investment models learn develop compelling business case clients ceos adopt ai machine_learning investment approaches rare gathering talents including quants data_scientists researchers ai machine_learning experts investment_officers explore solutions challenges potential risks pitfalls adopting ai machine_learning

Related Posts

在使用k近邻算法时,有没有办法获取被使用的“邻居”?

我想找到一种方法来确定在我的knn算法中实际使用了哪些…

Theano在Google Colab上无法启用GPU支持

我在尝试使用Theano库训练一个模型。由于我的电脑内…

准确性评分似乎有误

这里是代码: from sklearn.metrics…

Keras Functional API: “错误检查输入时:期望input_1具有4个维度,但得到形状为(X, Y)的数组”

我在尝试使用Keras的fit_generator来训…

如何使用sklearn.datasets.make_classification在指定范围内生成合成数据?

我想为分类问题创建合成数据。我使用了sklearn.d…

如何处理预测时不在训练集中的标签

已关闭。 此问题与编程或软件开发无关。目前不接受回答。…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注