如何将训练和测试数据集组合成相同格式

我正在练习使用这个数据集:http://archive.ics.uci.edu/ml/datasets/Census+Income

我已经加载了训练和测试数据。

# 下载训练和测试数据trainFile = "adult.data"; testFile = "adult.test"if (!file.exists (trainFile))download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",destfile = trainFile)if (!file.exists (testFile))download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",destfile = testFile)# 分配列名colNames = c ("age", "workclass", "fnlwgt", "education","educationnum", "maritalstatus", "occupation","relationship", "race", "sex", "capitalgain","capitalloss", "hoursperweek", "nativecountry","incomelevel")# 读取训练数据training = read.table (trainFile, header = FALSE, sep = ",",strip.white = TRUE, col.names = colNames,na.strings = "?", stringsAsFactors = TRUE)# 加载测试数据集testing = read.table (testFile, header = FALSE, sep = ",",strip.white = TRUE, col.names = colNames,na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)

我需要将这两个数据集合并成一个。但是,有一个问题。我发现两个数据的结构并不相同。

显示训练数据的结构

> str (training)'data.frame': 32561 obs. of 15 variables:$ age : int 39 50 38 53 28 37 49 52 31 42 ...$ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...$ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...$ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...$ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

显示测试数据的结构

> str (testing)'data.frame': 16282 obs. of 15 variables:$ age : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...$ workclass : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...$ fnlwgt : int NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...$ education : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...$ educationnum : int NA 7 9 12 10 10 6 9 15 10 ...$ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...$ occupation : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...$ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...$ race : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...$ sex : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...$ capitalgain : int NA 0 0 0 7688 0 0 0 3103 0 ...$ capitalloss : int NA 0 0 0 0 0 0 0 0 0 ...$ hoursperweek : int NA 40 50 40 40 30 30 40 32 40 ...$ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...$ incomelevel : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...

问题1:

age 在测试数据中变成了 factor,并且测试数据中所有 因子级别 比训练数据中的 因子级别 增加了1。这是因为测试数据中的第一行是一行不需要的数据。

|1x3 Cross validator

我尝试通过重新分配测试数据来解决这个问题:

testing = testing[-1,]

但是,重新运行 str() 命令后,我没有看到任何变化。

问题2:

正如我之前所说,我需要将这两个数据框合并成一个数据框。所以,我运行了以下代码:

combined <- rbind(training , testing)

除了问题1之外,运行 str() 后我发现了一个新问题

> str(combined)'data.frame':   48842 obs. of  15 variables: $ age          : chr  "39" "50" "38" "53" ... $ workclass : Factor w/ 9 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ... $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ... $ education    : Factor w/ 17 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ... $ educationnum : int  13 13 9 7 13 14 5 9 14 13 ... $ maritalstatus: Factor w/ 8 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ... $ occupation   : Factor w/ 15 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ... $ relationship : Factor w/ 7 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ... $ race         : Factor w/ 6 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ... $ sex          : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 1 2 ... $ capitalgain  : int  2174 0 0 0 0 0 0 0 14084 5178 ... $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ... $ hoursperweek : int  40 13 40 40 40 40 16 45 50 40 ... $ nativecountry: Factor w/ 42 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ... $ incomelevel  : Factor w/ 5 levels "<=50K",">50K",..: 1 1 1 1 1 1 1 2 2 2 ...

在合并后的数据框 combined 中,目标变量 (incomelevel) 的因子级别是 5,而在训练数据框中是 2(这是正确的),在测试数据框中是 3(由于问题1增加了1)。这是因为在测试数据框的 incomelevel 中的每个值后面都有一个 . (点) (<=50K., <=50K., >50K.,......)。所以,我需要去掉那个 .(点),但我不知道该如何去掉它。有没有相关的函数?

我对数据和R非常不熟悉,因此遇到这种基本问题。你能帮我解决我遇到的问题吗?


回答:

我认为你可以忽略测试数据的第一行,这样可以解决年龄变成因子的问题,因为它看起来像是一个标题:

head(readLines(testFile))[1] "|1x3 Cross validator"                                                                                                                 [2] "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."              [3] "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."

我们运行你的代码,我们可以使用read.csv,测试数据使用skip=1:

colNames = c ("age", "workclass", "fnlwgt", "education","educationnum", "maritalstatus", "occupation","relationship", "race", "sex", "capitalgain","capitalloss", "hoursperweek", "nativecountry","incomelevel")# 读取训练数据training = read.csv (trainFile, header = FALSE, col.names = colNames,stringsAsFactors = TRUE,na.strings = "?",strip.white = TRUE)testing = read.csv (testFile, header = FALSE, col.names = colNames,na.strings = "?",stringsAsFactors = TRUE,skip=1,strip.white = TRUE)

现在,收入水平,不幸的是我们必须手动更正它,幸好你检查了:

testing$incomelevel = factor(gsub("\\.","",as.character(testing$incomelevel)))

我们检查级别,唯一的区别是原籍国:

all.equal(sapply(testing,levels) ,sapply(training,levels))[1] "Component “nativecountry”: Lengths (40, 41) differ (string compare on first 40)"[2] "Component “nativecountry”: 26 string mismatches" 

我认为你可以做的不多,可能你得在合并前后移除它:

setdiff(levels(training$nativecountry),levels(testing$nativecountry))[1] "Holand-Netherlands"

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注