为什么在Lasso回归中计算MSE会得到不同的输出？

我正在尝试在来自lasso2包的Prostate癌症数据上运行不同的回归模型。当我使用Lasso时，我看到了两种不同的方法来计算均方误差。但它们确实给我带来了相当不同的结果，所以我想知道我是否做错了什么，或者这是否仅仅意味着一种方法比另一种方法更好？

# Needs the following R packages.library(lasso2)library(glmnet)# Gets the prostate cancer datasetdata(Prostate)# Defines the Mean Square Error function mse = function(x,y) { mean((x-y)^2)}# 75% of the sample size.smp_size = floor(0.75 * nrow(Prostate))# Sets the seed to make the partition reproductible.set.seed(907)train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)# Training settrain = Prostate[train_ind, ]# Test settest = Prostate[-train_ind, ]# Creates matrices for independent and dependent variables.xtrain = model.matrix(lpsa~. -1, data = train)ytrain = train$lpsaxtest = model.matrix(lpsa~. -1, data = test)ytest = test$lpsa# Fitting a linear model by Lasso regression on the "train" data setpr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse',alpha=1)lambda.lasso = pr.lasso$lambda.min# Getting predictions on the "test" data set and calculating the mean     square errorlasso.pred = predict(pr.lasso, s = lambda.lasso, newx = xtest) # Calculating MSE via the mse function defined abovemse.1 = mse(lasso.pred,ytest)cat("MSE (method 1): ", mse.1, "\n")# Calculating MSE via the cvm attribute inside the pr.lasso objectmse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]cat("MSE (method 2): ", mse.2, "\n")

所以这是我得到的两种MSE的输出：

MSE (method 1): 0.4609978 MSE (method 2): 0.5654089

它们相当不同。有人知道为什么吗？非常感谢您的帮助！

Samuel

回答：

正如@alistaire指出的那样，在第一种情况下，您使用测试数据来计算MSE，而在第二种情况下，报告的是来自交叉验证（训练）折叠的MSE，所以这不是一种公平的比较。

我们可以做类似以下的事情来进行公平的比较（通过保留训练折叠上的拟合值），正如我们所见，如果在相同的训练折叠上计算，mse.1和mse.2完全相等（尽管与您的数据略有不同，使用我的桌面R版本3.1.2，x86_64-w64-mingw32，Windows 10）：

# Needs the following R packages.library(lasso2)library(glmnet)# Gets the prostate cancer datasetdata(Prostate)# Defines the Mean Square Error function mse = function(x,y) { mean((x-y)^2)}# 75% of the sample size.smp_size = floor(0.75 * nrow(Prostate))# Sets the seed to make the partition reproductible.set.seed(907)train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)# Training settrain = Prostate[train_ind, ]# Test settest = Prostate[-train_ind, ]# Creates matrices for independent and dependent variables.xtrain = model.matrix(lpsa~. -1, data = train)ytrain = train$lpsaxtest = model.matrix(lpsa~. -1, data = test)ytest = test$lpsa# Fitting a linear model by Lasso regression on the "train" data set# keep the fitted values on the training foldspr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse', keep=TRUE, alpha=1)lambda.lasso = pr.lasso$lambda.minlambda.id <- which(pr.lasso$lambda == pr.lasso$lambda.min)# get the predicted values on the training folds with lambda.min (not from test data)mse.1 = mse(pr.lasso$fit[,lambda.id], ytrain) cat("MSE (method 1): ", mse.1, "\n")MSE (method 1):  0.6044496 # Calculating MSE via the cvm attribute inside the pr.lasso objectmse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]cat("MSE (method 2): ", mse.2, "\n")MSE (method 2):  0.6044496 mse.1 == mse.2[1] TRUE

学技术

为什么在Lasso回归中计算MSE会得到不同的输出？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复