NLTK语言模型的困惑

我想用Python的NLTK训练一个语言模型，但遇到了几个问题。首先，我不知道为什么我的单词在编写如下代码时变成了单个字符：

s = "Natural-language processing (NLP) is an area of computer science " \"and artificial intelligence concerned with the interactions " \"between computers and human (natural) languages."s = s.lower();paddedLine = pad_both_ends(word_tokenize(s),n=2);train, vocab = padded_everygram_pipeline(2, paddedLine)print(list(vocab))lm = MLE(2);lm.fit(train,vocab)

打印出的词汇表明显不正确（我不想处理字符！），这是输出的一部分：

<s>', '<', 's', '>', '</s>', '<s>', 'n', 'a', 't', 'u', 'r', 'a', 'l', '-', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '</s>', '<s>', 'p', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', '</s>', '<s>', '(', '</s>', '<s>', 'n', 'l', 'p', '</s>', '<s>', ')', '</s>'

为什么我的输入变成了字符？我尝试了另一种方法，但没有成功：

paddedLine = pad_both_ends(word_tokenize(s),n=2);#train, vocab = padded_everygram_pipeline(2, tokens)#train = everygrams(paddedLine,max_len = 2);train = ngrams(paddedLine,2);vocab = Vocabulary(paddedLine,unk_cutoff = 1);print(list(train))lm = MLE(2);lm.fit(train,vocab)

当我运行这段代码时，我的训练数据完全是空的！它显示的是”[]”！！奇怪的是，当我注释掉上面的代码中的这一行时：

vocab = Vocabulary(paddedLine,unk_cutoff = 1);

现在我的训练数据是正确的，像这样：

[('<s>', 'natural-language'), ('natural-language', 'processing'), ('processing', '('), ('(', 'nlp'), ('nlp', ')'), (')', 'is'), ('is', 'an'), ('an', 'area'), ('area', 'of'), ('of', 'computer'), ('computer', 'science'), ('science', 'and'), ('and', 'artificial'), ('artificial', 'intelligence'), ('intelligence', 'concerned'), ('concerned', 'with'), ('with', 'the'), ('the', 'interactions'), ('interactions', 'between'), ('between', 'computers'), ('computers', 'and'), ('and', 'human'), ('human', '('), ('(', 'natural'), ('natural', ')'), (')', 'languages'), ('languages', '.'), ('.', '</s>')]

这到底是怎么回事？顺便说一下，我不是Python或NLTK的专家，这是我的第一次尝试。下一个问题是，如何在训练的语言模型上使用Kneser-Ney平滑或加一平滑？我这样训练语言模型正确吗？我的训练数据很简单：

"Natural-language processing (NLP) is an area of computer science " \    "and artificial intelligence concerned with the interactions " \    "between computers and human (natural) languages."

谢谢。

回答：

padded_everygram_pipeline函数期望接收一个包含n-gram列表的列表。你应该按如下方式修改你的第一个代码片段。另外，Python的生成器是惰性序列，你不能对它们进行多次迭代。

from nltk import word_tokenizefrom nltk.lm import MLEfrom nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipelines = "Natural-language processing (NLP) is an area of computer science " \    "and artificial intelligence concerned with the interactions " \    "between computers and human (natural) languages."s = s.lower()paddedLine = [list(pad_both_ends(word_tokenize(s), n=2))]train, vocab = padded_everygram_pipeline(2, paddedLine)lm = MLE(2)lm.fit(train, vocab)print(lm.counts)

学技术

NLTK语言模型的困惑

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复