基于Python中多个特征的分层交叉验证或采样用于训练-测试分割

sklearn的train_test_split，StratifiedShuffleSplit和StratifiedKFold都基于类别标签（y变量或目标列）进行分层。如果我们想要基于特征列（x变量）进行采样，而不是基于目标列，该怎么办？如果只有一个特征，基于该单一列进行分层将会很容易，但如果有多个特征列，我们希望在选定的样本中保留总体比例，该怎么办？

下面我创建了一个df，它有一个偏斜的人口，低收入的人更多，女性更多，来自CA的人最少，来自MA的人最多。我希望选定的样本具有这些特征，即低收入的人更多，女性更多，来自CA的人最少，来自MA的人最多

import randomimport stringimport pandas as pdN = 20000 # 数据中的总行数names    = [''.join(random.choices(string.ascii_uppercase, k = 5)) for _ in range(N)]incomes  = [random.choices(['High','Low'], weights=(30, 70))[0] for _ in range(N)]genders  = [random.choices(['M','F'], weights=(40, 60))[0] for _ in range(N)]states   = [random.choices(['CA','IL','FL','MA'], weights=(10,20,30,40))[0] for _ in range(N)]targets_y= [random.choice([0,1]) for _ in range(N)]df = pd.DataFrame(dict(        name     = names,        income   = incomes,        gender   = genders,        state    = states,        target_y = targets_y    ))

当某些特征的例子非常少时，另一个复杂性出现了，我们希望在选定的样本中至少包括n个例子。考虑这个例子：

single_row = {'name' : 'ABC','income' : 'High','gender' : 'F','state' : 'NY','target_y' : 1}df = df.append(single_row, ignore_index=True)df

我希望这个添加的单行始终包含在测试分割中（这里n=1）。

回答：

这可以通过pandas的groupby实现：

让我们首先检查总体特征：

grps = df.groupby(['state','income','gender'], group_keys=False)grps.count()

接下来让我们创建一个包含原始数据20%的测试集

test_proportion = 0.2at_least = 1test = grps.apply(lambda x: x.sample(max(round(len(x)*test_proportion), at_least)))test

测试集的特征：

test.groupby(['state','income','gender']).count()

接下来我们创建训练集作为原始df与测试集的差集

print('测试中的样本数 =', len(test))train = set(df.name) - set(test.name)print('训练中的样本数 =', len(train))

测试中的样本数 = 4000

训练中的样本数 = 16001

学技术

基于Python中多个特征的分层交叉验证或采样用于训练-测试分割

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复