我对R语言还不太熟练,但我有一个问题。我有一个数据集(包含1593个观测值),其中包括一个字符类型变量,里面有多个字符串,还有一个因子变量,有两个层次 – 0和1 – 对应于每个字符串。为了进行分类,我希望将这个数据集的75%作为测试样本,25%作为训练样本,但同时希望在测试和训练样本中保持0的比例相同。有没有办法做到这一点?
这是我数据集的结构
data.frame': 1593 obs. of 6 variables: $ match_id: int 0 0 0 0 0 0 0 0 0 0 ... $ Binary : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ... $ key : chr "force it" "space created" "hah" "ez 500" ...
注意:我实际上是按照Brett Lantz的《Machine Learning with R》一书中的代码进行操作,并将其应用到我的数据集上。我希望在我的数据集中实现书中描述的这个部分:
To confirm that the subsets are representative of the complete set of SMS data, let'scompare the proportion of spam in the training and test data frames:> prop.table(table(sms_raw_train$type))ham spam0.8647158 0.1352842> prop.table(table(sms_raw_test$type))ham spam0.8683453 0.1316547Both the training data and test data contain about 13 percent spam. This suggeststhat the spam messages were divided evenly between the two datasets.
感谢任何帮助
回答:
createDataPartition()
函数来自caret包,通常用于此目的,例如:
library(caret)set.seed(300)trainIndex <- createDataPartition(iris$Species, p = .75, list = FALSE, times = 1)irisTrain <- iris[ trainIndex,]irisTest <- iris[-trainIndex,]str(irisTrain)>'data.frame': 114 obs. of 5 variables:> $ Sepal.Length: num 5.1 4.9 4.7 5 5.4 4.6 5 4.4 5.4 4.8 ...> $ Sepal.Width : num 3.5 3 3.2 3.6 3.9 3.4 3.4 2.9 3.7 3.4 ...> $ Petal.Length: num 1.4 1.4 1.3 1.4 1.7 1.4 1.5 1.4 1.5 1.6 ...> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.2 0.2 ...> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 >...str(irisTest)>'data.frame': 36 obs. of 5 variables:> $ Sepal.Length: num 4.6 4.9 5.1 5.1 4.6 4.8 5.2 5.5 5.5 5.1 ...> $ Sepal.Width : num 3.1 3.1 3.5 3.8 3.6 3.1 4.1 4.2 3.5 3.8 ...> $ Petal.Length: num 1.5 1.5 1.4 1.5 1 1.6 1.5 1.4 1.3 1.9 ...> $ Petal.Width : num 0.2 0.1 0.3 0.3 0.2 0.2 0.1 0.2 0.2 0.4 ...> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 >...prop.table(table(irisTrain$Species))> setosa versicolor virginica > 0.3333333 0.3333333 0.3333333 prop.table(table(irisTest$Species))> setosa versicolor virginica > 0.3333333 0.3333333 0.3333333
这提供了伪随机的分层抽样,将数据分成训练和测试组,这是我在自己工作中使用的做法。