NLTK 困惑度测量倒置

我有一段训练文本和一段测试文本。我想做的就是使用训练数据训练一个语言模型来计算测试数据的困惑度。

这是我的代码:

import osimport requestsimport io #codecsfrom nltk.util import everygramsfrom nltk.lm.preprocessing import pad_both_endsfrom nltk import word_tokenize, sent_tokenize fileTest = open("AaronPressman.txt","r");with io.open('AaronPressman.txt', encoding='utf8') as fin:        textTest = fin.read()if os.path.isfile('AaronPressmanEdited.txt'):    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:        text = fin.read()# Tokenize the text.tokenized_text = [list(map(str.lower, word_tokenize(sent)))                 for sent in sent_tokenize(text)]from nltk.lm.preprocessing import padded_everygram_pipelinefrom nltk.lm import Laplacen = 1padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)model = Laplace(n) model.fit(train_data, padded_sents)print(model.perplexity(trainTest)) 

当我运行这个代码并设置 n=1,即一元模型时,我得到了 "1068.332393940235"。当 n=2,即二元模型时,我得到了 "1644.3441077259993",而使用三元模型时,我得到了 2552.2085752565313

这里的问题是什么?


回答:

你创建测试数据的方式是错误的(训练数据转换为小写,但测试数据未转换为小写。测试数据中缺少开始和结束标记)。试试这个

import osimport requestsimport io #codecsfrom nltk.util import everygramsfrom nltk.lm.preprocessing import pad_both_endsfrom nltk.lm.preprocessing import padded_everygram_pipelinefrom nltk.lm import Laplacefrom nltk import word_tokenize, sent_tokenize """fileTest = open("AaronPressman.txt","r");with io.open('AaronPressman.txt', encoding='utf8') as fin:        textTest = fin.read()if os.path.isfile('AaronPressmanEdited.txt'):    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:        text = fin.read()"""textTest = "This is an ant. This is a cat"text = "This is an orange. This is a mango"n = 2# Tokenize the text.tokenized_text = [list(map(str.lower, word_tokenize(sent)))                 for sent in sent_tokenize(text)]train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)tokenized_text = [list(map(str.lower, word_tokenize(sent)))                 for sent in sent_tokenize(textTest)]test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)model = Laplace(1) model.fit(train_data, padded_sents)s = 0for i, test in enumerate(test_data):    p = model.perplexity(test)    s += pprint ("Perplexity: {0}".format(s/(i+1)))

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注