对于一个机器学习项目,我希望将数据分割成训练集和测试集,同时保持特定组的比例在各数据集之间一致。我创建了一个包含40行的虚拟数据框来解释我的需求。在这里,对于“地区”这一组,数据中有20%来自“北美”,50%来自“欧洲”,20%来自“亚洲”,10%来自“大洋洲”。我希望最终得到一个随机子集,例如整个数据的25%,其中“地区”组的百分比构成保持不变。
换句话说,我希望从以下数据开始:
City County Region1 Shangai China Asia2 Tokyo Japan Asia3 Osaka Japan Asia4 Hanoi Vietnam Asia5 Beijing China Asia6 Sapporo Japan Asia7 Tottori Japan Asia8 Saigon Vietnam Asia9 Rome Italy Europe10 Paris France Europe11 Lisbon Portugal Europe12 Berlin Germany Europe13 Madrid Spain Europe14 Vienna Austria Europe15 Naples Italy Europe16 Nice France Europe17 Porto Portugal Europe18 Frankfurt Germany Europe19 Sevilla Spain Europe20 Salzburg Austria Europe21 Barcelona Spain Europe 22 Amsterdam Netherlands Europe 23 Bern Switzerland Europe 24 Milan Italy Europe 25 San Sebastian Spain Europe 26 Rotterdam Netherlands Europe 27 Zurich Switzerland Europe 28 Turin Italy Europe 29 Ney York City US North America30 Toronto Canada North America31 Mexico City Mexico North America32 Atlanta US North America33 Chicago US North America34 Atlanta US North America35 Vancouver Canada North America36 Guadalajara Mexico North America37 Sydney Australia Oceania38 Wellington New Zealand Oceania39 Melbourne Australia Oceania40 Auckland New Zealand Oceania
并以这种方式结束(对我来说,随机选择行是很重要的):
City County Region1 New York US North America2 Mexico City Mexico North America3 Amsterdam Netherlands Europe 4 Madrid Spain Europe5 Lisbon Portugal Europe6 Rome Italy Europe7 Paris France Europe8 Tokyo Japan Asia9 Osaka Japan Asia10 Wellington New Zealand Oceania
回答:
caret
包中的createDataPartition()
函数可以用来将观察值分配到训练和测试组,同时保持分割变量每个类别的百分比分布。我们将使用来自《应用预测建模》的AlzheimerDisease数据来说明其用法。
library(caret)library(AppliedPredictiveModeling)set.seed(90125)data(AlzheimerDisease)adData = data.frame(diagnosis,predictors)inTrain = createDataPartition(adData$diagnosis, p = .6)[[1]]training = adData[ inTrain,]testing = adData[-inTrain,]
现在我们将为每个数据框中的因变量生成表格,每个数据框中的Impaired
百分比略低于38%。
> table(training$diagnosis)Impaired Control 55 146 > table(testing$diagnosis)Impaired Control 36 96 > 55/146[1] 0.3767123> 36/96[1] 0.375>
使用原帖中的数据
如果我们从提问中提供的数据中抽取75%的样本,我们可以将其分割成一个包含30行的训练数据框和一个包含10行的测试数据框。
# OP datatextFile <- "id|City|County|Region1|Shangai|China|Asia2|Tokyo|Japan|Asia3|Osaka|Japan|Asia4|Hanoi|Vietnam|Asia5|Beijing|China|Asia6|Sapporo|Japan|Asia7|Tottori|Japan|Asia8|Saigon|Vietnam|Asia9|Rome|Italy|Europe10|Paris|France|Europe11|Lisbon|Portugal|Europe12|Berlin|Germany|Europe13|Madrid|Spain|Europe14|Vienna|Austria|Europe15|Naples|Italy|Europe16|Nice|France|Europe17|Porto|Portugal|Europe18|Frankfurt|Germany|Europe19|Sevilla|Spain|Europe20|Salzbourg|Austria|Europe21|Barcelona|Spain|Europe22|Amsterdam|Netherlands|Europe23|Bern|Switzerland|Europe24|Milan|Italy|Europe25|SanSebastian|Spain|Europe26|Rotterdam|Netherlands|Europe27|Zurich|Switzerland|Europe28|Turin|Italy|Europe29|New York City|US|North America30|Toronto|Canada|North America31|Mexico City|Mexico|North America32|Atlanta|US|North America33|Chicago|US|North America34|Atlanta|US|North America35|Vancouver|Canada|North America36|Guadalajara|Mexico|North America37|Syndey|Australia|Oceania38|Wellington|New Zealand|Oceania39|Melbourn|Australia|Oceania40|Auckland|New Zealand|Oceania"data <- read.table(text = textFile,header = TRUE,sep = "|", stringsAsFactors = FALSE)set.seed(901250)inTrain = createDataPartition(data$Region, p = .75)[[1]]training = data[ inTrain,]testing = data[-inTrain,]
当我们打印测试数据的表格时,我们看到Region
的分布符合提问中的要求:20%亚洲,50%欧洲,20%北美,10%大洋洲。
> table(testing$Region) Asia Europe NorthAmerica Oceania 2 5 2 1 >
最后,我们将打印testing
数据框。
> testing id City County Region2 2 Tokyo Japan Asia8 8 Saigon Vietnam Asia9 9 Rome Italy Europe17 17 Porto Portugal Europe19 19 Sevilla Spain Europe21 21 Barcelona Spain Europe22 22 Amsterdam Netherlands Europe32 32 Atlanta US North America36 36 Guadalajara Mexico North America38 38 Wellington New Zealand Oceania>