我是时间序列机器学习的新手,我需要开发一个项目,我的数据是以分钟为单位的,有人能帮助我创建这个算法吗?
数据集: 每个值代表一分钟的采集时间(9:00, 9:01 …),采集持续10分钟,并在两个月内进行,即一月份有10个值,二月份有10个值。
目标: 我希望结果是三月份接下来10分钟的预测,例如:
2020-03-01 9:00:002020-03-01 9:01:002020-03-01 9:02:002020-03-01 9:03:00
训练: 训练必须包含一月和二月作为预测的参考,考虑到这是一个时间序列
季节性:
预测:
当前问题: 当前预测似乎失败了,以前的数值看起来不像是有效的时间序列,因为在季节性图像中可以看到,数据集显示为一条直线。预测由下图中的绿色线条表示,原始数据由蓝色线条表示,然而我们看到日期轴延伸至2020-11-01,它应该延伸至2020-03-01,此外原始数据在图表中形成了一个矩形
script.py
# -*- coding: utf-8 -*-try: import pandas as pd import numpy as np import pmdarima as pm #%matplotlib inline import matplotlib.pyplot as plt from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.arima_model import ARIMA from statsmodels.tsa.seasonal import seasonal_decompose from dateutil.parser import parseexcept ImportError as e: print("[FAILED] {}".format(e))class operationsArima(): @staticmethod def ForecastingWithArima(): try: # Import data = pd.read_csv('minute.csv', parse_dates=['date'], index_col='date') # Plot fig, axes = plt.subplots(2, 1, figsize=(10,5), dpi=100, sharex=True) # Usual Differencing axes[0].plot(data[:], label='Original Series') axes[0].plot(data[:].diff(1), label='Usual Differencing') axes[0].set_title('Usual Differencing') axes[0].legend(loc='upper left', fontsize=10) print("[OK] Generated axes") # Seasonal axes[1].plot(data[:], label='Original Series') axes[1].plot(data[:].diff(11), label='Seasonal Differencing', color='green') axes[1].set_title('Seasonal Differencing') plt.legend(loc='upper left', fontsize=10) plt.suptitle('Drug Sales', fontsize=16) plt.show() # Seasonal - fit stepwise auto-ARIMA smodel = pm.auto_arima(data, start_p=1, start_q=1, test='adf', max_p=3, max_q=3, m=11, start_P=0, seasonal=True, d=None, D=1, trace=True, error_action='ignore', suppress_warnings=True, stepwise=True) smodel.summary() print(smodel.summary()) print("[OK] Generated model") # Forecast n_periods = 11 fitted, confint = smodel.predict(n_periods=n_periods, return_conf_int=True) index_of_fc = pd.date_range(data.index[-1], periods = n_periods, freq='MS') # make series for plotting purpose fitted_series = pd.Series(fitted, index=index_of_fc) lower_series = pd.Series(confint[:, 0], index=index_of_fc) upper_series = pd.Series(confint[:, 1], index=index_of_fc) print("[OK] Generated series") # Plot plt.plot(data) plt.plot(fitted_series, color='darkgreen') plt.fill_between(lower_series.index, lower_series, upper_series, color='k', alpha=.15) plt.title("ARIMA - Final Forecast - Drug Sales") plt.show() print("[SUCESS] Generated forecast") except Exception as e: print("[FAILED] Caused by: {}".format(e))if __name__ == "__main__": flow = operationsArima() flow.ForecastingWithArima() # Init script
总结:
SARIMAX Results ================================================================================Dep. Variable: y No. Observations: 22Model: SARIMAX(0, 1, 0, 11) Log Likelihood nanDate: Mon, 13 Apr 2020 AIC nanTime: 21:19:10 BIC nanSample: 0 HQIC nan - 22 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975]------------------------------------------------------------------------------intercept 0 5.33e-13 0 1.000 -1.05e-12 1.05e-12sigma2 1e-10 5.81e-10 0.172 0.863 -1.04e-09 1.24e-09===================================================================================Ljung-Box (Q): nan Jarque-Bera (JB): nanProb(Q): nan Prob(JB): nanHeteroskedasticity (H): nan Skew: nanProb(H) (two-sided): nan Kurtosis: nan===================================================================================
回答:
我在这里看到了一些问题:由于您有两个以分钟为频率且相隔一个月的短时间序列,观察到您提到的蓝色直线是正常的。此外,绿色线看起来像是原始数据本身,这意味着模型的预测与您的原始数据完全相同。
最后,我认为将两个独立的时间序列拼接在一起不是一个好主意…