我正在训练一个模型以预测未来的降雨数据。我已经完成了模型的训练。我使用的数据集是: https://www.kaggle.com/redikod/historical-rainfall-data-in-bangladesh数据集看起来像这样:
Station Year Month Day Rainfall dayofyear1970-01-01 1 Dhaka 1970 1 1 0 11970-01-02 1 Dhaka 1970 1 2 0 21970-01-03 1 Dhaka 1970 1 3 0 31970-01-04 1 Dhaka 1970 1 4 0 41970-01-05 1 Dhaka 1970 1 5 0 5
我已经通过在线找到的参考代码完成了使用训练和测试数据的训练,并且还检查了预测值与真实值的对比。
这是代码,
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport tensorflow as tf#data is in local folderdf = pd.read_csv("data.csv")df.head(5)df.drop(df[(df['Day']>28) & (df['Month']==2) & (df['Year']%4!=0)].index,inplace=True)df.drop(df[(df['Day']>29) & (df['Month']==2) & (df['Year']%4==0)].index,inplace=True)df.drop(df[(df['Day']>30) & ((df['Month']==4)|(df['Month']==6)|(df['Month']==9)|(df['Month']==11))].index,inplace=True)date = [str(y)+'-'+str(m)+'-'+str(d) for y, m, d in zip(df.Year, df.Month, df.Day)]df.index = pd.to_datetime(date)df['date'] = df.indexdf['dayofyear']=df['date'].dt.dayofyeardf.drop('date',axis=1,inplace=True)df.head()df.size()df.info()df.plot(x='Year',y='Rainfall',style='.', figsize=(15,5))train = df.loc[df['Year'] <= 2015]test = df.loc[df['Year'] == 2016]train=train[train['Station']=='Dhaka']test=test[test['Station']=='Dhaka']X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1)Y_train=train['Rainfall']X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1)Y_test=test['Rainfall']from sklearn import svmfrom sklearn.svm import SVCmodel = svm.SVC(gamma='auto',kernel='linear')model.fit(X_train, Y_train)Y_pred = model.predict(X_test)df1 = pd.DataFrame({'Actual Rainfall': Y_test, 'Predicted Rainfall': Y_pred}) df1[df1['Predicted Rainfall']!=0].head(10)
在这之后,我尝试实际使用模型预测未来几天/几个月/几年的降雨量。我使用了一些方法,比如那些用于预测股票价格的方法(在调整代码后)。但似乎没有一个方法有效。由于我已经训练了模型,我以为预测未来几天会很容易。比如,我用1970年到2015年的数据进行训练,用2016年的数据进行测试。现在我想预测2017年的降雨量。类似这样的事情。
我的问题是,如何以一种直观的方式做到这一点?
如果有人能回答这个问题,我将非常感激。
编辑 @Mercury:这是使用那个代码后的实际结果。我怀疑模型根本没有运行…这是实际结果的图片: https://i.sstatic.net/81Vk1.png
回答:
我注意到这里有一个非常简单的错误:
X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1)Y_train=train['Rainfall']X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1)Y_test=test['Rainfall']
你没有从训练数据中删除Rainfall
列。
我大胆假设,你在训练和测试中都得到了100%的完美准确率,对吗?这就是原因。你的模型看到训练数据中的’Rainfall’列里无论有什么都是答案,所以它在测试时也照做不误,因此得到了完美的结果——但实际上它根本没有进行任何预测!
试着这样运行:
X_train=train.drop(['Station','StationIndex','dayofyear','Rainfall'],axis=1)Y_train=train['Rainfall']X_test=test.drop(['Station','StationIndex','dayofyear','Rainfall'],axis=1)Y_test=test['Rainfall']from sklearn import svmmodel = svm.SVC(gamma='auto',kernel='linear')model.fit(X_train, Y_train)print('Accuracy on training set: {:.2f}%'.format(100*model.score(X_train, Y_train)))print('Accuracy on testing set: {:.2f}%'.format(100*model.score(X_test, Y_test)))