使用lasso正则化处理因子和数值预测变量?

我有一个数据集,希望使用lasso方法进行特征筛选。由于我是R语言的新手,目前正在按照网上的指南进行操作。数据存储在一个数据框中,目标变量已经从数据框中移除,并单独存储在一个单列的数据框中。这是一个回归问题,目标变量是数值型的。以下是我尝试运行的代码:

library(glmnet)lasso_model <- cv.glmnet(                  x = as.matrix(train),                  y = train_target,                  alpha = 1)

以下是数据集的信息:

'data.frame':   9798 obs. of  55 variables:$ acres: num  0.186 2.991 0.144 0.218 0.173 ...$ above: int  1754 3030 1531 834 1022 1528 768 1184 2026 3176 ...$ basement: int  0 1811 500 440 0 476 0 0 732 0 ...$ baths: Factor w/ 7 levels "0","1","2","3",..: 3 4 3 3 2 3 2 2 3 3 ...$ toilets: Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 2 1 1 2 2    ...$ fireplaces: Factor w/ 6 levels "0","1","2","3",..: 2 2 2 2 1 1 1 2 2  2 ...$ beds: Factor w/ 7 levels "1","2","3","4",..: 4 5 2 2 2 3 2 2 3 5 ...$ rooms: Factor w/ 15 levels "0","1","2","3",..: 5 5 5 4 5 3 3 3 4 6 ...$ age: int  103 17 13 46 116 12 93 93 42 100 ...$ yearsfromsale: Factor w/ 3 levels "2","3","4": 2 2 2 1 2 2 3 3 1 1 ...$ car: Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 3 1 1 4 1 ...$ city_DES.MOINES: Factor w/ 2 levels "0","1": 2 1 1 2 2 1 2 2 2 2 ...$ city_JOHNSTON: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...$ city_WEST.DES.MOINES: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ city_CLIVE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ city_URBANDALE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ city_ALTOONA: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...$ city_BONDURANT: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ city_CROCKER.TWNSHP: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ city_GRIMES: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ city_POLK.CITY: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ city_PLEASANT.HILL: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ city_WINDSOR.HEIGHTS: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50315: Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...$ zip_50321: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...$ zip_50320: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50312: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...$ zip_50314: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50311: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50309: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50316: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...$ zip_50317: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50313: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...$ zip_50310: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50322: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50131: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...$ zip_50111: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...$ zip_50265: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50266: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50325: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50323: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50009: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...$ zip_50035: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50023: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50226: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50021: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50327: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ zip_50324: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ walkout_0: Factor w/ 2 levels "0","1": 2 1 2 2 2 1 2 2 2 2 ...$ walkout_1: Factor w/ 2 levels "0","1": 1 2 1 1 1 2 1 1 1 1 ...$ condition_Normal: Factor w/ 2 levels "0","1": 1 2 2 1 1 2 1 1 1 1 ...$ condition_Above.Normal: Factor w/ 2 levels "0","1": 2 1 1 2 2 1 2 1 1 2 ...$ condition_Below.Normal: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 1 ...$ AC_1: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...

在尝试运行lasso_model这行代码时,我遇到了以下错误:

Error in cbind2(1, newx) %*% nbeta : invalid class 'NA' to  dup_mMatrix_as_dgeMatrix

总的来说,我想确定哪些变量可以移除。任何帮助都将不胜感激!


回答:

好的,这是一个强烈的怀疑。

你的数据框中包含因子。as.matrix将它们转换为字符串而不是数字,glmnet不知道如何处理它们:

> df <- data.frame(a=as.factor(c('0', '1', '2')), b=as.factor(c('0', '0', '1')))> df  a b1 0 02 1 03 2 1> as.matrix(df)     a   b  [1,] "0" "0"[2,] "1" "0"[3,] "2" "1"

尝试将它们明确地转换回数字(这是一种迂回的方法,但应该有效):

> as.matrix(data.frame(lapply(df, function(x) as.numeric(as.character(x)))))     a b[1,] 0 0[2,] 1 0[3,] 2 1

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注