我想通过姓氏预测,例如区分中国人和非中国人。我特别想从姓氏中提取三个字母的子字符串。例如,姓氏“gao”会生成一个特征“gao”,而“chan”会生成两个特征“cha”和“han”。
在下面的three_split函数中,拆分操作已经成功完成。但据我所知,要将这些作为特征集使用,我需要将输出返回为字典。有什么提示可以做到这一点吗?对于“Chan”的字典,应该返回“cha”和“han”为TRUE。
from nltk.classify import PositiveNaiveBayesClassifierimport rechinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']def three_split(word): word = word.lower() word = word.replace(" ", "_") split = 3 return [word[start:start+split] for start in range(0, len(word)-2)]positive_featuresets = list(map(three_split, chinese_names))unlabeled_featuresets = list(map(three_split, nonchinese_names))classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)print three_split("Jim Silva")print classifier.classify(three_split("Jim Silva"))
回答:
这是一个白盒答案:
使用你的原始代码,它输出的是:
Traceback (most recent call last): File "test.py", line 17, in <module> unlabeled_featuresets) File "/usr/local/lib/python2.7/dist-packages/nltk/classify/positivenaivebayes.py", line 108, in train for fname, fval in featureset.items():AttributeError: 'list' object has no attribute 'items'
查看第17行:
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)
似乎PositiveNaiveBayesClassifier
需要一个具有.items()
属性的对象,直觉上如果NLTK代码是Pythonic的,它应该是一个dict
。
查看https://github.com/nltk/nltk/blob/develop/nltk/classify/positivenaivebayes.py#L88,没有明确解释positive_featuresets
参数应该包含什么内容:
:param positive_featuresets: 已知为正例的特征集列表(即它们的标签为
True
)。
检查文档字符串,我们看到这个例子:
Example: >>> from nltk.classify import PositiveNaiveBayesClassifier关于运动的一些句子: >>> sports_sentences = [ 'The team dominated the game', ... 'They lost the ball', ... 'The game was intense', ... 'The goalkeeper catched the ball', ... 'The other team controlled the ball' ]混合主题,包括运动: >>> various_sentences = [ 'The President did not comment', ... 'I lost the keys', ... 'The team won the game', ... 'Sara has two kids', ... 'The ball went off the court', ... 'They had the ball for the whole game', ... 'The show is over' ]句子的特征只是它包含的单词: >>> def features(sentence): ... words = sentence.lower().split() ... return dict(('contains(%s)' % w, True) for w in words)我们使用运动句子作为正例,混合句子作为未标记的例子: >>> positive_featuresets = list(map(features, sports_sentences)) >>> unlabeled_featuresets = list(map(features, various_sentences)) >>> classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, ... unlabeled_featuresets)
现在我们找到了feature()
函数,它将句子转换为特征并返回
dict(('contains(%s)' % w, True) for w in words)
基本上这就是能够调用.items()
的东西。查看字典推导式,似乎'contains(%s)' % w
有点多余,除非是为了人类可读性。所以你可以只使用dict((w, True) for w in words)
。
此外,将空格替换为下划线也可能是多余的,除非后面有用处。
最后,切片和有限迭代可以用ngram函数替换,该函数可以提取字符ngram,例如:
>>> word = 'alexgao'>>> split=3>>> [word[start:start+split] for start in range(0, len(word)-2)]['ale', 'lex', 'exg', 'xga', 'gao']# 使用ngrams>>> from nltk.util import ngrams>>> ["".join(ng) for ng in ngrams(word,3)]['ale', 'lex', 'exg', 'xga', 'gao']
你的特征提取函数可以简化为如下形式:
from nltk.util import ngramsdef three_split(word): return dict(("".join(ng), True) for ng in ngrams(word.lower(),3))
[out]:
{'im ': True, 'm s': True, 'jim': True, 'ilv': True, ' si': True, 'lva': True, 'sil': True}False
事实上,NLTK分类器非常 versatile,你可以使用字符元组作为特征,因此在提取特征时不需要拼接ngram,即:
from nltk.classify import PositiveNaiveBayesClassifierimport refrom nltk.util import ngramschinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']def three_split(word): return dict(((ng, True) for ng in ngrams(word.lower(),3)))positive_featuresets = list(map(three_split, chinese_names))unlabeled_featuresets = list(map(three_split, nonchinese_names))classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)print three_split("Jim Silva")print classifier.classify(three_split("Jim Silva"))
[out]:
{('m', ' ', 's'): True, ('j', 'i', 'm'): True, ('s', 'i', 'l'): True, ('i', 'l', 'v'): True, (' ', 's', 'i'): True, ('l', 'v', 'a'): True, ('i', 'm', ' '): True}