### Python样本数据框的示例，与pandas Dataframe.sample()类似，但总是选择n个相邻的值

我想将我的数据框分割成训练集和测试集，但测试集应该包含整个数据中例如3个相邻的行多次。我不知道如何正确地写出这个问题，请直接查看表格。我希望通过块的方式分割我的数据框。

所有数据：

Y	row_num	x1	x2
value	1	some value	some other value
value	2	some value	some other value
value	3	some value	some other value
value	4	some value	some other value
value	5	some value	some other value
value	6	some value	some other value
value	7	some value	some other value
value	8	some value	some other value
value	9	some value	some other value
value	10	some value	some other value
value	11	some value	some other value

我想要的：

训练集：

Y	row_num	x1	x2
value	1	some value	some other value
value	5	some value	some other value
value	6	some value	some other value
value	10	some value	some other value
value	11	some value	some other value

测试集：

Y	row_num	x1	x2
value	2	some value	some other value
value	3	some value	some other value
value	4	some value	some other value
value	7	some value	some other value
value	8	some value	some other value
value	9	some value	some other value

回答：

可能有更优雅和/或更有效的方法来实现您的目标。我还没有想到在列表中随机选择固定数量的n个连续元素（不重复）的解决方案。

我可能会从做类似以下的事情开始：


import random
def custom_split(df, train_size, n_adjacent=3):
    # 所需的n_adjacent连续行的集合数量。
    test_size = int(len(df)*(1-train_size)//n_adjacent)
    n_attempt = 10
    while n_attempt > 0:
        retry = False
        available_idx = list(range(len(df)))
        test_idx = []
        for _ in range(test_size):
            # 如果没有更多的连续索引，它将从头开始重试。
            if len(available_idx) < n_adjacent:
                retry = True
                n_attempt -= 1
                break
            # 从可用索引中选择一个索引。
            add_idx = random.choice(available_idx[:-(n_adjacent-1)])
            # 扩展到这个索引及其后面的两个索引。
            new_idx = list(range(add_idx, add_idx + n_adjacent))
            # 从可用列表中移除这些索引，
            # 也移除不再是n_adjacent连续索引的一部分的索引。
            available_idx = [idx for idx in available_idx if idx not in new_idx 
                             and idx + n_adjacent - 1 not in new_idx]
            test_idx.extend(new_idx)
        if not retry:
            # 成功了。
            # 将test_idx标记为False。
            train_idx = np.ones(len(df), dtype=np.bool)
            train_idx[test_idx] = False
            return df.iloc[train_idx,:], df.iloc[test_idx,:]
    # 如果尝试10次失败，则引发异常。
    raise Exception("无法找到连续索引以随机选择。")
# 80% 训练集，20% 测试集，向上取整训练部分。
# 多亏了掩码，整个数据框都被表示。
train_set, test_set = custom_split(a_dataframe, train_size = 0.8, n_adjacent = 5)

这个解决方案的主要问题是，当调用random.choice时，可能会缺乏连续的索引。这就是while循环的原因：只要失败，它就会重试，最多10次，否则将引发异常。

"idx"不是数据框中索引列的值，而是行在其轴上的位置。这就是为什么我使用iloc而不是loc的原因。

使用20行数据框，70%的train_size和3个n_adjacent的结果：


# IDX
# train:[0, 1, 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16]
# test:[17, 18, 19, 9, 10, 11]

不要忘记根据您的需求，在之后对训练集或两个集合进行洗牌。这里有一种优雅的方式来洗牌数据框的行：https://stackoverflow.com/a/34879805/10409093

学技术

### Python样本数据框的示例，与pandas Dataframe.sample()类似，但总是选择n个相邻的值

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复