时间序列数据的滑动窗口训练/测试分割

我有一个包含36个数据点的序列,我想对其进行滑动窗口的训练和测试。我查看了TimeSeriesSplit(),但它只能做类似于以下的事情:

('TRAIN:', array([0, 1, 2]), 'TEST:', array([3, 4, 5]))('TRAIN:', array([0, 1, 2, 3, 4, 5]), 'TEST:', array([6, 7, 8]))('TRAIN:', array([0, 1, 2, 3, 4, 5, 6, 7, 8]), 'TEST:', array([ 9, 10, 11]))

我想实现一个固定长度为12的滑动窗口,每次移动一个点,测试集也使用一个固定长度为3的滑动窗口。例如:

('TRAIN:', array([0,1,2,3,4,5,6,7,8,9,10,11]),  'TEST:', array([12,13,14]))('TRAIN:', array([1,2,3,4,5,6,7,8,9,10,11,12]),  'TEST:', array([13,14,15]))('TRAIN:', array([2,3,4,5,6,7,8,9,10,11,12,13]),  'TEST:', array([14,15,16]))...

我阅读了这篇文章(https://ntguardian.wordpress.com/2017/06/19/walk-forward-analysis-demonstration-backtrader/)并尝试了以下代码:

from sklearn.model_selection import TimeSeriesSplitfrom sklearn.utils import indexablefrom sklearn.utils.validation import _num_samplesimport numpy as npclass TimeSeriesSplitImproved(TimeSeriesSplit):    def split(self, X, y=None, groups=None, fixed_length=False,              train_splits=1, test_splits=1):        X, y, groups = indexable(X, y, groups)        n_samples = _num_samples(X)        n_splits = self.n_splits        n_folds = n_splits + 1        train_splits, test_splits = int(train_splits), int(test_splits)        if n_folds > n_samples:            raise ValueError(                ("Cannot have number of folds ={0} greater"                 " than the number of samples: {1}.").format(n_folds,                                                             n_samples))        if (n_folds - train_splits - test_splits) <= 0 and test_splits > 0:            raise ValueError(                ("Both train_splits and test_splits must be positive"                 " integers."))        indices = np.arange(n_samples)        split_size = (n_samples // n_folds)        test_size = split_size * test_splits        train_size = split_size * train_splits        test_starts = range(train_size + n_samples % n_folds,                            n_samples - (test_size - split_size),                            split_size)        if fixed_length:            for i, test_start in zip(range(len(test_starts)),                                     test_starts):                rem = 0                if i == 0:                    rem = n_samples % n_folds                yield (indices[(test_start - train_size - rem):test_start],indices[test_start:test_start + test_size])        else:            for test_start in test_starts:                yield (indices[:test_start],indices[test_start:test_start + test_size])model = TimeSeriesSplitImproved(n_splits=5)for train_index, test_index in model.split(X,fixed_length=True,train_splits=2, test_splits=1):    print("TRAIN:", train_index, "TEST:", test_index)    train, test = X[train_index], X[test_index]

但只得到了以下结果:

TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11] TEST: [12 13 14 15 16 17]TRAIN: [ 6  7  8  9 10 11 12 13 14 15 16 17] TEST: [18 19 20 21 22 23]TRAIN: [12 13 14 15 16 17 18 19 20 21 22 23] TEST: [24 25 26 27 28 29]TRAIN: [18 19 20 21 22 23 24 25 26 27 28 29] TEST: [30 31 32 33 34 35]

提前感谢您的帮助!


回答:

考虑到你的数据集有36个点,你可以很容易地手动完成这个操作。以下示例应该会有所帮助:

import numpy as npdata = list(range(36))window_size = 12splits = []for i in range(window_size, len(data)):    train = np.array(data[i-window_size:i])    test = np.array(data[i:i+3])    splits.append(('TRAIN:', train, 'TEST:', test))# 查看结果for a_tuple in splits:    print(a_tuple)# ('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]), 'TEST:', array([12, 13, 14]))# ('TRAIN:', array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]), 'TEST:', array([13, 14, 15]))# ('TRAIN:', array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]), 'TEST:', array([14, 15, 16]))

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注