我在多个文本上运行LDA。当我生成一些关于生成的主题的可视化时,我发现”machine_learning”这个双词组被词形还原为”machine_learning”和”machine_learne”。这是我能提供的最小可复现示例:
import en_core_web_smtokenized = [ [ 'artificially_intelligent', 'funds', 'generating', 'excess', 'returns', 'artificial_intelligence', 'deep_learning', 'compelling', 'reasons', 'join_us', 'artificially_intelligent', 'fund', 'develop', 'ai', 'machine_learning', 'capabilities', 'real', 'cases', 'big', 'players', 'industry', 'discover', 'emerging', 'trends', 'latest_developments', 'ai', 'machine_learning', 'industry', 'players', 'trading', 'investing', 'live', 'investment', 'models', 'learn', 'develop', 'compelling', 'business', 'case', 'clients', 'ceos', 'adopt', 'ai', 'machine_learning', 'investment', 'approaches', 'rare', 'gathering', 'talents', 'including', 'quants', 'data_scientists', 'researchers', 'ai', 'machine_learning', 'experts', 'investment_officers', 'explore', 'solutions', 'challenges', 'potential', 'risks', 'pitfalls', 'adopting', 'ai', 'machine_learning' ], [ 'recent_years', 'topics', 'data_science', 'artificial_intelligence', 'machine_learning', 'big_data', 'become_increasingly', 'popular', 'growth', 'fueled', 'collection', 'availability', 'data', 'continually', 'increasing', 'processing', 'power', 'storage', 'open', 'source', 'movement', 'making', 'tools', 'widely', 'available', 'result', 'already', 'witnessed', 'profound', 'changes', 'work', 'rest', 'play', 'trend', 'increase', 'world', 'finance', 'impacted', 'investment', 'managers', 'particular', 'join_us', 'explore', 'data_science', 'means', 'finance_professionals' ]]nlp = en_core_web_sm.load(disable=['parser', 'ner'])def lemmatization(descrips, allowed_postags=None): if allowed_postags is None: allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV'] lemmatized_descrips = [] for descrip in descrips: doc = nlp(" ".join(descrip)) lemmatized_descrips.append([ token.lemma_ for token in doc if token.pos_ in allowed_postags ]) return lemmatized_descripslemmatized = lemmatization(tokenized)print(lemmatized)
您会注意到,”machine_learne”在输入的tokenized
中找不到,但在输出的lemmatized
中找到了”machine_learning”和”machine_learne”。
这是什么原因造成的?它会对其他双词组/三词组造成问题吗?
回答:
我认为您误解了词性标注和词形还原的过程。
词性标注不仅仅基于单个词本身(我不知道您的母语是什么,但这在许多语言中都很常见),还基于周围的词(例如,一个常见的学习规则是,在许多陈述中,动词通常由名词前置,这代表了动词的主语)。
当您将所有这些“词元”传递给您的词形还原器时,spacy的词形还原器会试图“猜测”您单个词的词性类别。
在许多情况下,它会选择默认的名词,如果该词不在常见和不规则名词的查找表中,它会尝试使用通用规则(例如去掉复数的’s’)。
在其他情况下,它会基于一些模式选择默认动词(例如末尾的”-ing”),这可能是您的情况。由于没有任何词典中存在动词”machine_learning”(它的模型中没有实例),它会选择“否则”路线并应用通用规则。
因此,machine_learning可能被一个通用的“ing”变为”e”规则词形还原(例如在制作->制作,烘焙->烘焙的情况下),这对许多规则动词很常见。
看这个测试示例:
for descrip in tokenized: doc = nlp(" ".join(descrip)) print([ (token.pos_, token.text) for token in doc ])
输出:
[(‘NOUN’, ‘artificially_intelligent’), (‘NOUN’, ‘funds’), (‘VERB’, ‘generating’), (‘ADJ’, ‘excess’), (‘NOUN’, ‘returns’), (‘NOUN’, ‘artificial_intelligence’), (‘NOUN’, ‘deep_learning’), (‘ADJ’, ‘compelling’), (‘NOUN’, ‘reasons’), (‘PROPN’, ‘join_us’), (‘NOUN’, ‘artificially_intelligent’), (‘NOUN’, ‘fund’), (‘NOUN’, ‘develop’), (‘VERB’, ‘ai’), (‘VERB’, ‘machine_learning’), (‘NOUN’, ‘capabilities’), (‘ADJ’, ‘real’), (‘NOUN’, ‘cases’), (‘ADJ’, ‘big’), (‘NOUN’, ‘players’), (‘NOUN’, ‘industry’), (‘VERB’, ‘discover’), (‘VERB’, ’emerging’), (‘NOUN’, ‘trends’), (‘NOUN’, ‘latest_developments’), (‘VERB’, ‘ai’), (‘VERB’, ‘machine_learning’), (‘NOUN’, ‘industry’), (‘NOUN’, ‘players’), (‘NOUN’, ‘trading’), (‘VERB’, ‘investing’), (‘ADJ’, ‘live’), (‘NOUN’, ‘investment’), (‘NOUN’, ‘models’), (‘VERB’, ‘learn’), (‘VERB’, ‘develop’), (‘ADJ’, ‘compelling’), (‘NOUN’, ‘business’), (‘NOUN’, ‘case’), (‘NOUN’, ‘clients’), (‘NOUN’, ‘ceos’), (‘VERB’, ‘adopt’), (‘VERB’, ‘ai’), (‘ADJ’, ‘machine_learning’), (‘NOUN’, ‘investment’), (‘NOUN’, ‘approaches’), (‘ADJ’, ‘rare’), (‘VERB’, ‘gathering’), (‘NOUN’, ‘talents’), (‘VERB’, ‘including’), (‘NOUN’, ‘quants’), (‘NOUN’, ‘data_scientists’), (‘NOUN’, ‘researchers’), (‘VERB’, ‘ai’), (‘ADJ’, ‘machine_learning’), (‘NOUN’, ‘experts’), (‘NOUN’, ‘investment_officers’), (‘VERB’, ‘explore’), (‘NOUN’, ‘solutions’), (‘VERB’, ‘challenges’), (‘ADJ’, ‘potential’), (‘NOUN’, ‘risks’), (‘NOUN’, ‘pitfalls’), (‘VERB’, ‘adopting’), (‘VERB’, ‘ai’), (‘NOUN’, ‘machine_learning’)]
您会发现machine_learning基于上下文既作为动词又作为名词。但请注意,仅仅将词连接起来会因为它们没有按自然语言预期的顺序排列而变得混乱。
即使是人类也无法理解并正确地对这段文本进行词性标注:
artificially_intelligent funds generating excess returns artificial_intelligence deep_learning compelling reasons join_us artificially_intelligent fund develop ai machine_learning capabilities real cases big players industry discover emerging trends latest_developments ai machine_learning industry players trading investing live investment models learn develop compelling business case clients ceos adopt ai machine_learning investment approaches rare gathering talents including quants data_scientists researchers ai machine_learning experts investment_officers explore solutions challenges potential risks pitfalls adopting ai machine_learning