在R语言中将列作为因子追加到数据框时,追加的列中会生成NA

我在学习R语言,并且通过使用caret包来尝试学习机器学习。

问题 – 在创建dummies并移除NZV变量后,当我将Y预测变量作为因子添加回数据框时,它会在同一列中生成NA(问题步骤5-6)。那么,如何在最终数据框中保持Y变量为因子呢?

1. 数据(来自uci/kaggle的银行营销响应数据)

str(data)
'data.frame':   4119 obs. of  21 variables: $ age           : int  30 39 25 38 47 32 32 41 31 35 ... $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ... $ marital       : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ... $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ... $ default       : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ... $ housing       : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ... $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ... $ contact       : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ... $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ... $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ... $ duration      : int  487 346 227 17 58 128 290 44 68 170 ... $ campaign      : int  2 4 1 3 1 3 4 2 1 1 ... $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ... $ previous      : int  0 0 0 0 0 2 0 0 1 0 ... $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ... $ emp.var.rate  : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ... $ cons.price.idx: num  92.9 94 94.5 94.5 93.2 ... $ cons.conf.idx : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ... $ euribor3m     : num  1.31 4.86 4.96 4.96 4.19 ... $ nr.employed   : num  5099 5191 5228 5228 5196 ... $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

2. 保存X和Y变量

Y = subset(data, select = y)X = subset(data, select = -y)dim(X)dim(Y)
[1] 4119   20[1] 4119    1

3. 创建dummies

pp_dummy <- dummyVars(y ~ ., data = data)data <- predict(pp_dummy, newdata = data)data <- data.frame(data)

4. 使用接近零方差移除变量

nzv_list <- nearZeroVar(data) %>%             as.vector()data <- data[, -nzv_list ]str(data)
'data.frame':   4119 obs. of  44 variables: $ age                          : num  30 39 25 38 47 32 32 41 31 35 ... $ job.admin.                   : num  0 0 0 0 1 0 1 0 0 0 ... $ job.blue.collar              : num  1 0 0 0 0 0 0 0 0 1 ... $ job.management               : num  0 0 0 0 0 0 0 0 0 0 ... $ job.services                 : num  0 1 1 1 0 1 0 0 1 0 ... $ job.technician               : num  0 0 0 0 0 0 0 0 0 0 ... $ marital.divorced             : num  0 0 0 0 0 0 0 0 1 0 ... $ marital.married              : num  1 0 1 1 1 0 0 1 0 1 ... $ marital.single               : num  0 1 0 0 0 1 1 0 0 0 ... $ education.basic.4y           : num  0 0 0 0 0 0 0 0 0 0 ... $ education.basic.6y           : num  0 0 0 0 0 0 0 0 0 0 ... $ education.basic.9y           : num  1 0 0 1 0 0 0 0 0 1 ... $ education.high.school        : num  0 1 1 0 0 0 0 0 0 0 ... $ education.professional.course: num  0 0 0 0 0 0 0 0 1 0 ... $ education.university.degree  : num  0 0 0 0 1 1 1 1 0 0 ... $ default.no                   : num  1 1 1 1 1 1 1 0 1 0 ... $ default.unknown              : num  0 0 0 0 0 0 0 1 0 1 ... $ housing.no                   : num  0 1 0 0 0 1 0 0 1 1 ... $ housing.yes                  : num  1 0 1 0 1 0 1 1 0 0 ... $ loan.no                      : num  1 1 1 0 1 1 1 1 1 1 ... $ loan.yes                     : num  0 0 0 0 0 0 0 0 0 0 ... $ contact.cellular             : num  1 0 0 0 1 1 1 1 1 0 ... $ contact.telephone            : num  0 1 1 1 0 0 0 0 0 1 ... $ month.apr                    : num  0 0 0 0 0 0 0 0 0 0 ... $ month.aug                    : num  0 0 0 0 0 0 0 0 0 0 ... $ month.jul                    : num  0 0 0 0 0 0 0 0 0 0 ... $ month.jun                    : num  0 0 1 1 0 0 0 0 0 0 ... $ month.may                    : num  1 1 0 0 0 0 0 0 0 1 ... $ month.nov                    : num  0 0 0 0 1 0 0 1 1 0 ... $ day_of_week.fri              : num  1 1 0 1 0 0 0 0 0 0 ... $ day_of_week.mon              : num  0 0 0 0 1 0 1 1 0 0 ... $ day_of_week.thu              : num  0 0 0 0 0 1 0 0 0 1 ... $ day_of_week.tue              : num  0 0 0 0 0 0 0 0 1 0 ... $ day_of_week.wed              : num  0 0 1 0 0 0 0 0 0 0 ... $ duration                     : num  487 346 227 17 58 128 290 44 68 170 ... $ campaign                     : num  2 4 1 3 1 3 4 2 1 1 ... $ previous                     : num  0 0 0 0 0 2 0 0 1 0 ... $ poutcome.failure             : num  0 0 0 0 0 1 0 0 1 0 ... $ poutcome.nonexistent         : num  1 1 1 1 1 0 1 1 0 1 ... $ emp.var.rate                 : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ... $ cons.price.idx               : num  92.9 94 94.5 94.5 93.2 ... $ cons.conf.idx                : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ... $ euribor3m                    : num  1.31 4.86 4.96 4.96 4.19 ... $ nr.employed                  : num  5099 5191 5228 5228 5196 ...

5. 问题:在将y作为因子添加到数据时,会在列中产生NA

data$y <- as.factor(Y)str(data)
'data.frame':   4119 obs. of  45 variables: $ age                          : num  30 39 25 38 47 32 32 41 31 35 ... $ job.admin.                   : num  0 0 0 0 1 0 1 0 0 0 ... $ job.blue.collar              : num  1 0 0 0 0 0 0 0 0 1 ... $ job.management               : num  0 0 0 0 0 0 0 0 0 0 ... $ job.services                 : num  0 1 1 1 0 1 0 0 1 0 ... $ job.technician               : num  0 0 0 0 0 0 0 0 0 0 ... $ marital.divorced             : num  0 0 0 0 0 0 0 0 1 0 ... $ marital.married              : num  1 0 1 1 1 0 0 1 0 1 ... $ marital.single               : num  0 1 0 0 0 1 1 0 0 0 ... $ education.basic.4y           : num  0 0 0 0 0 0 0 0 0 0 ... $ education.basic.6y           : num  0 0 0 0 0 0 0 0 0 0 ... $ education.basic.9y           : num  1 0 0 1 0 0 0 0 0 1 ... $ education.high.school        : num  0 1 1 0 0 0 0 0 0 0 ... $ education.professional.course: num  0 0 0 0 0 0 0 0 1 0 ... $ education.university.degree  : num  0 0 0 0 1 1 1 1 0 0 ... $ default.no                   : num  1 1 1 1 1 1 1 0 1 0 ... $ default.unknown              : num  0 0 0 0 0 0 0 1 0 1 ... $ housing.no                   : num  0 1 0 0 0 1 0 0 1 1 ... $ housing.yes                  : num  1 0 1 0 1 0 1 1 0 0 ... $ loan.no                      : num  1 1 1 0 1 1 1 1 1 1 ... $ loan.yes                     : num  0 0 0 0 0 0 0 0 0 0 ... $ contact.cellular             : num  1 0 0 0 1 1 1 1 1 0 ... $ contact.telephone            : num  0 1 1 1 0 0 0 0 0 1 ... $ month.apr                    : num  0 0 0 0 0 0 0 0 0 0 ... $ month.aug                    : num  0 0 0 0 0 0 0 0 0 0 ... $ month.jul                    : num  0 0 0 0 0 0 0 0 0 0 ... $ month.jun                    : num  0 0 1 1 0 0 0 0 0 0 ... $ month.may                    : num  1 1 0 0 0 0 0 0 0 1 ... $ month.nov                    : num  0 0 0 0 1 0 0 1 1 0 ... $ day_of_week.fri              : num  1 1 0 1 0 0 0 0 0 0 ... $ day_of_week.mon              : num  0 0 0 0 1 0 1 1 0 0 ... $ day_of_week.thu              : num  0 0 0 0 0 1 0 0 0 1 ... $ day_of_week.tue              : num  0 0 0 0 0 0 0 0 1 0 ... $ day_of_week.wed              : num  0 0 1 0 0 0 0 0 0 0 ... $ duration                     : num  487 346 227 17 58 128 290 44 68 170 ... $ campaign                     : num  2 4 1 3 1 3 4 2 1 1 ... $ previous                     : num  0 0 0 0 0 2 0 0 1 0 ... $ poutcome.failure             : num  0 0 0 0 0 1 0 0 1 0 ... $ poutcome.nonexistent         : num  1 1 1 1 1 0 1 1 0 1 ... $ emp.var.rate                 : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ... $ cons.price.idx               : num  92.9 94 94.5 94.5 93.2 ... $ cons.conf.idx                : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ... $ euribor3m                    : num  1.31 4.86 4.96 4.96 4.19 ... $ nr.employed                  : num  5099 5191 5228 5228 5196 ... $ y                            : Factor w/ 1 level "1:2": NA NA NA NA NA NA NA NA NA NA ...

6. 如果我直接添加Y,它不会立即生成NA,但当我将其转换为因子时,它会生成NA

data$y <- Y # as.factor(Y)data <- data %>% mutate(y = as.factor(y))str(data)

(更新)

7. 如果我不将其转换为因子,那么我总是需要使用pull(data$y)而不是直接使用data$y。以下是示例:

subsets <- c(7, 10, 12, 15, 20)control <- rfeControl(functions = rfFuncs, method = "cv", verbose = FALSE)system.time(  RFE_res <- rfe(x = data[, 1:44],    # subset(train, select = -y)                         y = pull(data$y),                         sizes = subsets,                        rfeControl = control                 )) 

如何避免使用pull(data$y),而直接使用data$y呢?


回答:

这与pull()无关。

你不能将一个数据框转换为向量,即使只有1列:

X = subset(iris,select=-Species)Y = subset(iris,select=Species)as.factor(Y)Species    <NA> Levels: 1:3.valid.factor(Y)[1] "factor levels must be \"character\""levels(Y)NULL

你需要调用数据框的列:

X$y = as.factor(Y$Species)# or X %>% mutate(y = as.factor(Y$Species))> str(X)'data.frame':   150 obs. of  5 variables: $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ y           : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注