我有一段训练文本和一段测试文本。我想做的就是使用训练数据训练一个语言模型来计算测试数据的困惑度。
这是我的代码:
import osimport requestsimport io #codecsfrom nltk.util import everygramsfrom nltk.lm.preprocessing import pad_both_endsfrom nltk import word_tokenize, sent_tokenize fileTest = open("AaronPressman.txt","r");with io.open('AaronPressman.txt', encoding='utf8') as fin: textTest = fin.read()if os.path.isfile('AaronPressmanEdited.txt'): with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin: text = fin.read()# Tokenize the text.tokenized_text = [list(map(str.lower, word_tokenize(sent))) for sent in sent_tokenize(text)]from nltk.lm.preprocessing import padded_everygram_pipelinefrom nltk.lm import Laplacen = 1padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)model = Laplace(n) model.fit(train_data, padded_sents)print(model.perplexity(trainTest))
当我运行这个代码并设置 n=1,即一元模型时,我得到了 "1068.332393940235"
。当 n=2,即二元模型时,我得到了 "1644.3441077259993"
,而使用三元模型时,我得到了 2552.2085752565313
。
这里的问题是什么?
回答:
你创建测试数据的方式是错误的(训练数据转换为小写,但测试数据未转换为小写。测试数据中缺少开始和结束标记)。试试这个
import osimport requestsimport io #codecsfrom nltk.util import everygramsfrom nltk.lm.preprocessing import pad_both_endsfrom nltk.lm.preprocessing import padded_everygram_pipelinefrom nltk.lm import Laplacefrom nltk import word_tokenize, sent_tokenize """fileTest = open("AaronPressman.txt","r");with io.open('AaronPressman.txt', encoding='utf8') as fin: textTest = fin.read()if os.path.isfile('AaronPressmanEdited.txt'): with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin: text = fin.read()"""textTest = "This is an ant. This is a cat"text = "This is an orange. This is a mango"n = 2# Tokenize the text.tokenized_text = [list(map(str.lower, word_tokenize(sent))) for sent in sent_tokenize(text)]train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)tokenized_text = [list(map(str.lower, word_tokenize(sent))) for sent in sent_tokenize(textTest)]test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)model = Laplace(1) model.fit(train_data, padded_sents)s = 0for i, test in enumerate(test_data): p = model.perplexity(test) s += pprint ("Perplexity: {0}".format(s/(i+1)))