NLTK 困惑度测量倒置

我有一段训练文本和一段测试文本。我想做的就是使用训练数据训练一个语言模型来计算测试数据的困惑度。

这是我的代码:

import osimport requestsimport io #codecsfrom nltk.util import everygramsfrom nltk.lm.preprocessing import pad_both_endsfrom nltk import word_tokenize, sent_tokenize fileTest = open("AaronPressman.txt","r");with io.open('AaronPressman.txt', encoding='utf8') as fin:        textTest = fin.read()if os.path.isfile('AaronPressmanEdited.txt'):    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:        text = fin.read()# Tokenize the text.tokenized_text = [list(map(str.lower, word_tokenize(sent)))                 for sent in sent_tokenize(text)]from nltk.lm.preprocessing import padded_everygram_pipelinefrom nltk.lm import Laplacen = 1padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)model = Laplace(n) model.fit(train_data, padded_sents)print(model.perplexity(trainTest))

当我运行这个代码并设置 n=1，即一元模型时，我得到了 "1068.332393940235"。当 n=2，即二元模型时，我得到了 "1644.3441077259993"，而使用三元模型时，我得到了 2552.2085752565313。

这里的问题是什么？

回答：

你创建测试数据的方式是错误的（训练数据转换为小写，但测试数据未转换为小写。测试数据中缺少开始和结束标记）。试试这个

import osimport requestsimport io #codecsfrom nltk.util import everygramsfrom nltk.lm.preprocessing import pad_both_endsfrom nltk.lm.preprocessing import padded_everygram_pipelinefrom nltk.lm import Laplacefrom nltk import word_tokenize, sent_tokenize """fileTest = open("AaronPressman.txt","r");with io.open('AaronPressman.txt', encoding='utf8') as fin:        textTest = fin.read()if os.path.isfile('AaronPressmanEdited.txt'):    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:        text = fin.read()"""textTest = "This is an ant. This is a cat"text = "This is an orange. This is a mango"n = 2# Tokenize the text.tokenized_text = [list(map(str.lower, word_tokenize(sent)))                 for sent in sent_tokenize(text)]train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)tokenized_text = [list(map(str.lower, word_tokenize(sent)))                 for sent in sent_tokenize(textTest)]test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)model = Laplace(1) model.fit(train_data, padded_sents)s = 0for i, test in enumerate(test_data):    p = model.perplexity(test)    s += pprint ("Perplexity: {0}".format(s/(i+1)))

学技术

NLTK 困惑度测量倒置

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复