在训练和测试中保持类分布的情况下划分数据集 [重复]

我想在一个给定数据集上运行10次机器学习算法,该数据集具有以下分布

np.unique(x[:,24], return_counts=True)(array([1., 2.]), array([700, 300]))

这意味着我的数据中有70%来自类别1,30%来自类别2。

下面是我数据的一个快照。最后一列显示类别标签(1或2):

1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0,1,0,0,1,12,48,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0,1,0,0,1,24,12,4,21,1,4,3,3,1,49,3,1,2,1,1,0,0,1,0,0,1,0,1,0,11,42,2,79,1,4,3,4,2,45,3,1,2,1,1,0,0,0,0,0,0,0,0,1,11,24,3,49,1,3,3,4,4,53,3,2,2,1,1,1,0,1,0,0,0,0,0,1,24,36,2,91,5,3,3,4,4,35,3,1,2,2,1,0,0,1,0,0,0,0,1,0,14,24,2,28,3,5,3,4,2,53,3,1,1,1,1,0,0,1,0,0,1,0,0,1,12,36,2,69,1,3,3,2,3,35,3,1,1,2,1,0,1,1,0,1,0,0,0,0,14,12,2,31,4,4,1,4,1,61,3,1,1,1,1,0,0,1,0,0,1,0,1,0,12,30,4,52,1,1,4,2,3,28,3,2,1,1,1,1,0,1,0,0,1,0,0,0,22,12,2,13,1,2,2,1,3,25,3,1,1,1,1,1,0,1,0,1,0,0,0,1,21,48,2,43,1,2,2,4,2,24,3,1,1,1,1,0,0,1,0,1,0,0,0,1,22,12,2,16,1,3,2,1,3,22,3,1,1,2,1,0,0,1,0,0,1,0,0,1,11,24,4,12,1,5,3,4,3,60,3,2,1,1,1,1,0,1,0,0,1,0,1,0,21,15,2,14,1,3,2,4,3,28,3,1,1,1,1,1,0,1,0,1,0,0,0,1,11,24,2,13,2,3,2,2,3,32,3,1,1,1,1,0,0,1,0,0,1,0,1,0,24,24,4,24,5,5,3,4,2,53,3,2,1,1,1,0,0,1,0,0,1,0,0,1,11,30,0,81,5,2,3,3,3,25,1,3,1,1,1,0,0,1,0,0,1,0,0,1,12,24,2,126,1,5,2,2,4,44,3,1,1,2,1,0,1,1,0,0,0,0,0,0,24,24,2,34,3,5,3,2,3,31,3,1,2,2,1,0,0,1,0,0,1,0,0,1,14,9,4,21,1,3,3,4,3,48,3,3,1,2,1,1,0,1,0,0,1,0,0,1,11,6,2,26,3,3,3,3,1,44,3,1,2,1,1,0,0,1,0,1,0,0,0,1,11,10,4,22,1,2,3,3,1,48,3,2,2,1,2,1,0,1,0,1,0,0,1,0,12,12,4,18,2,2,3,4,2,44,3,1,1,1,1,0,1,1,0,0,1,0,0,1,14,10,4,21,5,3,4,1,3,26,3,2,1,1,2,0,0,1,0,0,1,0,0,1,11,6,2,14,1,3,3,2,1,36,1,1,1,2,1,0,0,1,0,0,1,0,1,0,14,6,0,4,1,5,4,4,3,39,3,1,1,1,1,0,0,1,0,0,1,0,1,0,13,12,1,4,4,3,2,3,1,42,3,2,1,1,1,0,0,1,0,1,0,0,0,1,12,7,2,24,1,3,3,2,1,34,3,1,1,1,1,0,0,0,0,0,1,0,0,1,11,60,3,68,1,5,3,4,4,63,3,2,1,2,1,0,0,1,0,0,1,0,0,1,22,18,2,19,4,2,4,3,1,36,1,1,1,2,1,0,0,1,0,0,1,0,0,1,11,24,2,40,1,3,3,2,3,27,2,1,1,1,1,0,0,1,0,0,1,0,0,1,12,18,2,59,2,3,3,2,3,30,3,2,1,2,1,1,0,1,0,0,1,0,0,1,14,12,4,13,5,5,3,4,4,57,3,1,1,1,1,0,0,1,0,1,0,0,1,0,13,12,2,15,1,2,2,1,2,33,1,1,1,2,1,0,0,1,0,0,1,0,0,0,12,45,4,47,1,2,3,2,2,25,3,2,1,1,1,0,0,1,0,0,1,0,1,0,24,48,4,61,1,3,3,3,4,31,1,1,1,2,1,0,0,1,0,0,0,0,0,1,1

完整的数据集可以在这里找到

我想将数据划分为90%用于训练,10%用于测试。然而,对于每次划分,我必须保持数据的比例(例如,在训练和验证划分中,70%的数据必须来自类别1,30%来自类别2)

我知道如何简单地将数据划分为训练和测试,但我不知道如何使这种划分遵循我上面提到的类别分布。如何在Python中实现这一点?


回答:

你可以使用RepeatedStratifiedKFold,顾名思义,它会重复K-Fold交叉验证器n次。要重复这个过程10次,可以设置n_repeats,并且要在train/test大小上大致达到9:1的比例,我们可以设置n_splits=10

from sklearn.model_selection import RepeatedStratifiedKFoldX = a[:,:-1]y = a[:,-1]rskf = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=2)for train_index, test_index in rskf.split(X, y):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]    print(f'\nClass 1: {((y_train==1).sum()/len(y_train))*100:.0f}%')     print(f'\nShape of train: {X_train.shape[0]}')    print(f'Shape of test: {X_test.shape[0]}')

Class 1: 73%Shape of train: 33Shape of test: 4Class 1: 73%Shape of train: 33Shape of test: 4Class 1: 73%Shape of train: 33Shape of test: 4Class 1: 73%Shape of train: 33Shape of test: 4...

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注