我想用Python的NLTK训练一个语言模型,但遇到了几个问题。首先,我不知道为什么我的单词在编写如下代码时变成了单个字符:
s = "Natural-language processing (NLP) is an area of computer science " \"and artificial intelligence concerned with the interactions " \"between computers and human (natural) languages."s = s.lower();paddedLine = pad_both_ends(word_tokenize(s),n=2);train, vocab = padded_everygram_pipeline(2, paddedLine)print(list(vocab))lm = MLE(2);lm.fit(train,vocab)
打印出的词汇表明显不正确(我不想处理字符!),这是输出的一部分:
<s>', '<', 's', '>', '</s>', '<s>', 'n', 'a', 't', 'u', 'r', 'a', 'l', '-', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '</s>', '<s>', 'p', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', '</s>', '<s>', '(', '</s>', '<s>', 'n', 'l', 'p', '</s>', '<s>', ')', '</s>'
为什么我的输入变成了字符?我尝试了另一种方法,但没有成功:
paddedLine = pad_both_ends(word_tokenize(s),n=2);#train, vocab = padded_everygram_pipeline(2, tokens)#train = everygrams(paddedLine,max_len = 2);train = ngrams(paddedLine,2);vocab = Vocabulary(paddedLine,unk_cutoff = 1);print(list(train))lm = MLE(2);lm.fit(train,vocab)
当我运行这段代码时,我的训练数据完全是空的!它显示的是”[]”!!奇怪的是,当我注释掉上面的代码中的这一行时:
vocab = Vocabulary(paddedLine,unk_cutoff = 1);
现在我的训练数据是正确的,像这样:
[('<s>', 'natural-language'), ('natural-language', 'processing'), ('processing', '('), ('(', 'nlp'), ('nlp', ')'), (')', 'is'), ('is', 'an'), ('an', 'area'), ('area', 'of'), ('of', 'computer'), ('computer', 'science'), ('science', 'and'), ('and', 'artificial'), ('artificial', 'intelligence'), ('intelligence', 'concerned'), ('concerned', 'with'), ('with', 'the'), ('the', 'interactions'), ('interactions', 'between'), ('between', 'computers'), ('computers', 'and'), ('and', 'human'), ('human', '('), ('(', 'natural'), ('natural', ')'), (')', 'languages'), ('languages', '.'), ('.', '</s>')]
这到底是怎么回事?顺便说一下,我不是Python或NLTK的专家,这是我的第一次尝试。下一个问题是,如何在训练的语言模型上使用Kneser-Ney平滑或加一平滑?我这样训练语言模型正确吗?我的训练数据很简单:
"Natural-language processing (NLP) is an area of computer science " \ "and artificial intelligence concerned with the interactions " \ "between computers and human (natural) languages."
谢谢。
回答:
padded_everygram_pipeline
函数期望接收一个包含n-gram列表的列表。你应该按如下方式修改你的第一个代码片段。另外,Python的生成器是惰性序列,你不能对它们进行多次迭代。
from nltk import word_tokenizefrom nltk.lm import MLEfrom nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipelines = "Natural-language processing (NLP) is an area of computer science " \ "and artificial intelligence concerned with the interactions " \ "between computers and human (natural) languages."s = s.lower()paddedLine = [list(pad_both_ends(word_tokenize(s), n=2))]train, vocab = padded_everygram_pipeline(2, paddedLine)lm = MLE(2)lm.fit(train, vocab)print(lm.counts)