我正在尝试创建一个LSTM
模型,该模型提供二进制输出,即买入或不买入。我的数据格式为:[日期时间, 收盘价, 交易量]
,有数百万行数据。我在将数据格式化为3-D(样本,时间步长,特征)时遇到了困难。
我已经使用pandas读取了数据。我希望将数据格式化为4000个样本,每个样本有400个时间步长,以及两个特征(收盘价和交易量)。有人能告诉我如何做到这一点吗?
编辑:我按照建议使用了TimeseriesGenerator,但我不确定如何检查我的序列并用我自己的二进制买入输出替换输出Y。
df = normalize_data(df)print("为神经网络创建序列 \n")targets = df.drop('date_time', 1)train = keras.preprocessing.sequence.TimeseriesGenerator(df, targets, 1, sampling_rate=1, stride=1, start_index=0, end_index=int(len(df.index)*0.8), shuffle=True, reverse=False, batch_size=time_steps)
这段代码运行时没有错误,但现在输出的是输入时间序列后的第一个收盘价值。
编辑2:到目前为止,我的代码如下所示:
df = data.normalize_data(df)targets = df.iloc[:, 3] # 买入信号目标df.drop('y1', axis=1, inplace=True)df.drop('y2', axis=1, inplace=True)train = TimeseriesGenerator(df, targets, length=1, sampling_rate=1, stride=1, start_index=0, end_index=int(len(df.index) * 0.8), shuffle=True, reverse=False, batch_size=time_steps)# 样本数量print("样本数: " + str(len(train)))x, y = train[0]print(str(x))
输出如下:
样本数: 8Traceback (most recent call last):File "/home/stian/.local/lib/python3.6/site- packages/pandas/core/indexes/base.py", line 3078, in get_locreturn self._engine.get_loc(key)File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_locFile "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_locFile "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_itemFile "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_itemKeyError: range(418, 419)During handling of the above exception, another exception occurred:Traceback (most recent call last):File "./main.py", line 94, in <module>data_menu()File "./main.py", line 42, in data_menudata_menu()File "./main.py", line 56, in data_menunn_menu()File "./main.py", line 76, in nn_menunn.nn_gen(pre_processed_data)File "/home/stian/git/stian9k/nn.py", line 33, in nn_genx, y = train[0]File "/home/stian/.local/lib/python3.6/site-packages/keras_preprocessing/sequence.py", line 378, in __getitem__samples[j] = self.data[indices]File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__return self._getitem_column(key)File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_columnreturn self._get_item_cache(key)File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cachevalues = self._data.get(item)File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in getloc = self.items.get_loc(item)File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_locreturn self._engine.get_loc(self._maybe_cast_indexer(key))File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_locFile "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_locFile "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_itemFile "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_itemKeyError: range(418, 419)
因此,尽管我从生成器中获得了8个对象,但我无法查找它们。如果我测试类型:print(str(type(train))),我得到的是TimeseriesGenerator对象。再次感谢您的任何建议。
编辑3:事实证明,TimeseriesGenerator不喜欢pandas数据框。通过将数据转换为numpy数组以及将pandas时间戳类型转换为浮点数解决了这个问题。
回答:
您可以简单地使用Keras的TimeseriesGenerator来达到这个目的。您可以轻松设置每个样本的时间步长数量(即长度)、采样率和步长来对数据进行子采样。
它将返回Sequence
类的实例,然后您可以将其传递给fit_generator
以便在由它生成的数据上拟合模型。我强烈建议阅读文档以获取更多关于这个类的信息、其参数及其使用方法。