是否有方法将数据集按相同比例的分类值进行划分？

我对R语言还不太熟练，但我有一个问题。我有一个数据集（包含1593个观测值），其中包括一个字符类型变量，里面有多个字符串，还有一个因子变量，有两个层次 – 0和1 – 对应于每个字符串。为了进行分类，我希望将这个数据集的75%作为测试样本，25%作为训练样本，但同时希望在测试和训练样本中保持0的比例相同。有没有办法做到这一点？

这是我数据集的结构

data.frame':    1593 obs. of  6 variables: $ match_id: int  0 0 0 0 0 0 0 0 0 0 ... $ Binary  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ... $ key     : chr  "force it" "space created" "hah" "ez 500" ...

注意：我实际上是按照Brett Lantz的《Machine Learning with R》一书中的代码进行操作，并将其应用到我的数据集上。我希望在我的数据集中实现书中描述的这个部分：

To confirm that the subsets are representative of the complete set of SMS data, let'scompare the proportion of spam in the training and test data frames:> prop.table(table(sms_raw_train$type))ham        spam0.8647158 0.1352842> prop.table(table(sms_raw_test$type))ham         spam0.8683453 0.1316547Both the training data and test data contain about 13 percent spam. This suggeststhat the spam messages were divided evenly between the two datasets.

感谢任何帮助

回答：

createDataPartition()函数来自caret包，通常用于此目的，例如：

library(caret)set.seed(300)trainIndex <- createDataPartition(iris$Species, p = .75,                                   list = FALSE,                                   times = 1)irisTrain <- iris[ trainIndex,]irisTest  <- iris[-trainIndex,]str(irisTrain)>'data.frame':  114 obs. of  5 variables:> $ Sepal.Length: num  5.1 4.9 4.7 5 5.4 4.6 5 4.4 5.4 4.8 ...> $ Sepal.Width : num  3.5 3 3.2 3.6 3.9 3.4 3.4 2.9 3.7 3.4 ...> $ Petal.Length: num  1.4 1.4 1.3 1.4 1.7 1.4 1.5 1.4 1.5 1.6 ...> $ Petal.Width : num  0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.2 0.2 ...> $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 >...str(irisTest)>'data.frame':  36 obs. of  5 variables:> $ Sepal.Length: num  4.6 4.9 5.1 5.1 4.6 4.8 5.2 5.5 5.5 5.1 ...> $ Sepal.Width : num  3.1 3.1 3.5 3.8 3.6 3.1 4.1 4.2 3.5 3.8 ...> $ Petal.Length: num  1.5 1.5 1.4 1.5 1 1.6 1.5 1.4 1.3 1.9 ...> $ Petal.Width : num  0.2 0.1 0.3 0.3 0.2 0.2 0.1 0.2 0.2 0.4 ...> $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 >...prop.table(table(irisTrain$Species))>    setosa versicolor  virginica > 0.3333333  0.3333333  0.3333333 prop.table(table(irisTest$Species))>   setosa versicolor  virginica > 0.3333333  0.3333333  0.3333333

这提供了伪随机的分层抽样，将数据分成训练和测试组，这是我在自己工作中使用的做法。

学技术

是否有方法将数据集按相同比例的分类值进行划分？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复