我获得了一些关于臭氧、NO、NO2 和 CO 的年度数据用于研究。任务是使用这些数据来预测臭氧的值。假设我有2015年、2016年、2018年和2019年的数据。我需要使用2015年、2016年和2018年的数据来预测2019年的臭氧值。
数据格式是按小时记录的,并以月份的形式呈现图片。所以数据是以这种格式存在的。
我所做的:首先,将各年的数据整合到一个Excel文件中,该文件包含4列,即NO、NO2、CO、O3。然后按月份将所有数据添加进去。这就是我所使用的主文件附件图片
我使用的是Python。首先必须清理数据。让我解释一下。NO、NO2 和 CO 是臭氧的前体,这意味着臭氧气体的生成依赖于这些气体,并且数据需要提前清理,约束条件是删除任何负值,并删除包括其他列在内的整行,因此如果臭氧、NO、NO2 和 CO 的任何一个值无效,我们必须删除整行而不计入它。数据中还包含了一些字符串格式,这些也需要被移除。这一切都已经完成。然后我应用了来自sklearn的MLP回归器。这是我所做的代码。
from sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import explained_variance_scorefrom sklearn.neural_network import MLPRegressorfrom sklearn.metrics import mean_absolute_errorimport pandas as pdimport matplotlib.pyplot as pltbugs = ['NOx', '* 43.3', '* 312', '11/19', '11/28', '06:00', '09/30', '09/04', '14:00', '06/25', '07:00', '06/02', '17:00', '04/10', '04/17', '18:00', '02/26', '02/03', '01:00', '11/23', '15:00', '11/12', '24:00', '09/02', '16:00', '09/28', '* 16.8', '* 121', '12:00', '06/24', '13:00', '06/26', 'Span', 'NoData', 'ppb', 'Zero', 'Samp<', 'RS232']dataset = pd.read_excel("Testing.xlsx")dataset = pd.DataFrame(dataset).replace(bugs, 0)dataset.dropna(subset=["O3"], inplace=True)dataset.dropna(subset=["NO"], inplace=True)dataset.dropna(subset=["NO2"], inplace=True)dataset.dropna(subset=["CO"], inplace=True)dataset.drop(dataset[dataset['O3'] < 1].index, inplace=True)dataset.drop(dataset[dataset['O3'] > 160].index, inplace=True)dataset.drop(dataset[dataset['O3'] == 0].index, inplace=True)dataset.drop(dataset[dataset['NO'] < 1].index, inplace=True)dataset.drop(dataset[dataset['NO'] > 160].index, inplace=True)dataset.drop(dataset[dataset['NO'] == 0].index, inplace=True)dataset.drop(dataset[dataset['NO2'] < 1].index, inplace=True)dataset.drop(dataset[dataset['NO2'] > 160].index, inplace=True)dataset.drop(dataset[dataset['NO2'] == 0].index, inplace=True)dataset.drop(dataset[dataset['CO'] < 1].index, inplace=True)dataset.drop(dataset[dataset['CO'] > 4000].index, inplace=True)dataset.drop(dataset[dataset['CO'] == 0].index, inplace=True)dataset = dataset.reset_index()dataset = dataset.drop(['index'], axis=1)X = dataset[["NO", "NO2", "CO"]].astype(int)Y = dataset[["O3"]].astype(int)X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.05, random_state=27)sc_x = StandardScaler()X_train = sc_x.fit_transform(X_train)X_test = sc_x.fit_transform(X_test)clf = MLPRegressor(hidden_layer_sizes=(100,100,100), max_iter=10000,verbose=True,random_state=8)clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print(explained_variance_score(y_test, y_pred))print(mean_absolute_error(y_test, y_pred))y_test = pd.DataFrame(y_test)y_test = y_test.reset_index(0)y_test = y_test.drop(['index'], axis=1)# y_test = y_test.drop([19,20],axis=0)y_pred = pd.DataFrame(y_pred)y_pred = y_pred.shift(-1)# y_pred = y_pred.drop([19,20],axis=0)plt.figure(figsize=(10, 5))plt.plot(y_pred, color='r', label='PredictedO3')plt.plot(y_test, color='g', label='OriginalO3')plt.legend()plt.show()
控制台:
y = column_or_1d(y, warn=True)Iteration 1, loss = 537.59597297Iteration 2, loss = 185.33662023Iteration 3, loss = 159.32122111Iteration 4, loss = 156.71612690Iteration 5, loss = 155.05307865Iteration 6, loss = 154.59351630Iteration 7, loss = 154.16687592Iteration 8, loss = 153.69258698Iteration 9, loss = 153.36140320Iteration 10, loss = 152.94593665Iteration 11, loss = 152.75124494Iteration 12, loss = 152.73893578Iteration 13, loss = 152.27131771Iteration 14, loss = 152.08732297Iteration 15, loss = 151.83197245Iteration 16, loss = 151.29399626Iteration 17, loss = 150.96425147Iteration 18, loss = 150.47673257Iteration 19, loss = 150.14353009Iteration 20, loss = 149.74165931Iteration 21, loss = 149.39158575Iteration 22, loss = 149.28863163Iteration 23, loss = 148.95356802Iteration 24, loss = 148.82618770Iteration 25, loss = 148.18070387Iteration 26, loss = 147.79069739Iteration 27, loss = 147.03057672Iteration 28, loss = 146.77822749Iteration 29, loss = 146.47159952Iteration 30, loss = 145.77185465Iteration 31, loss = 145.54493110Iteration 32, loss = 145.58297196Iteration 33, loss = 145.05848640Iteration 34, loss = 144.73301133Iteration 35, loss = 144.04886503Iteration 36, loss = 143.82328142Iteration 37, loss = 143.87060411Iteration 38, loss = 143.84762507Iteration 39, loss = 142.64434158Iteration 40, loss = 142.63539287Iteration 41, loss = 142.55569644Iteration 42, loss = 142.33659309Iteration 43, loss = 142.08105262Iteration 44, loss = 141.84181483Iteration 45, loss = 143.50650508Iteration 46, loss = 141.34511656Iteration 47, loss = 141.26444355Iteration 48, loss = 140.37034198Iteration 49, loss = 140.15212097Iteration 50, loss = 140.21204360Iteration 51, loss = 140.01652524Iteration 52, loss = 139.55019562Iteration 53, loss = 139.96862367Iteration 54, loss = 139.18904418Iteration 55, loss = 138.96940532Iteration 56, loss = 138.74715169Iteration 57, loss = 138.42219317Iteration 58, loss = 138.87739582Iteration 59, loss = 138.48879907Iteration 60, loss = 138.32348064Iteration 61, loss = 138.25489777Iteration 62, loss = 137.35913024Iteration 63, loss = 137.34553482Iteration 64, loss = 137.81499126Iteration 65, loss = 137.24418131Iteration 66, loss = 138.22142987Iteration 67, loss = 136.68683284Iteration 68, loss = 136.80873025Iteration 69, loss = 136.89557260Iteration 70, loss = 137.78914828Iteration 71, loss = 136.39181767Iteration 72, loss = 136.90698714Iteration 73, loss = 136.15180171Iteration 74, loss = 136.29621913Iteration 75, loss = 136.54671797Iteration 76, loss = 136.17984691Iteration 77, loss = 135.46193871Iteration 78, loss = 135.72399747Iteration 79, loss = 135.66833438Iteration 80, loss = 135.59829106Iteration 81, loss = 134.89759461Iteration 82, loss = 135.13978950Iteration 83, loss = 135.13023951Iteration 84, loss = 134.74279949Iteration 85, loss = 135.81422214Iteration 86, loss = 134.91660517Iteration 87, loss = 134.42552779Iteration 88, loss = 134.69309963Iteration 89, loss = 135.12116240Iteration 90, loss = 134.58731261Iteration 91, loss = 135.03610330Iteration 92, loss = 135.49753508Iteration 93, loss = 134.34645918Iteration 94, loss = 133.73179994Iteration 95, loss = 133.63077367Iteration 96, loss = 133.77330604Iteration 97, loss = 134.34313391Iteration 98, loss = 133.89467176Iteration 99, loss = 134.16270723Iteration 100, loss = 133.69654234Iteration 101, loss = 134.06460647Iteration 102, loss = 133.67570066Iteration 103, loss = 133.51941546Iteration 104, loss = 134.44514524Iteration 105, loss = 133.77755818Iteration 106, loss = 133.45007788Iteration 107, loss = 133.07441490Iteration 108, loss = 134.99803516Iteration 109, loss = 133.80158058Iteration 110, loss = 132.86973595Iteration 111, loss = 132.95281816Iteration 112, loss = 132.55546679Iteration 113, loss = 133.89665148Iteration 114, loss = 132.92319206Iteration 115, loss = 133.02169313Iteration 116, loss = 133.23774543Iteration 117, loss = 132.03027124Iteration 118, loss = 133.18472212Iteration 119, loss = 132.34502179Iteration 120, loss = 132.55417269Iteration 121, loss = 132.43373273Iteration 122, loss = 132.26810570Iteration 123, loss = 133.17705777Iteration 124, loss = 133.58044956Iteration 125, loss = 132.12074893Iteration 126, loss = 131.93800952Iteration 127, loss = 132.30641181Iteration 128, loss = 131.81882504Iteration 129, loss = 132.06413592Iteration 130, loss = 132.24680375Iteration 131, loss = 132.12261129Iteration 132, loss = 132.35714616Iteration 133, loss = 131.90862418Iteration 134, loss = 131.73195382Iteration 135, loss = 131.55302493Iteration 136, loss = 131.41382323Iteration 137, loss = 131.62962730Iteration 138, loss = 132.49231086Iteration 139, loss = 131.14651158Iteration 140, loss = 131.46236192Iteration 141, loss = 131.36319145Iteration 142, loss = 131.87374996Iteration 143, loss = 132.08955722Iteration 144, loss = 131.28997320Iteration 145, loss = 131.35961909Iteration 146, loss = 131.20954288Iteration 147, loss = 131.99304728Iteration 148, loss = 130.76432171Iteration 149, loss = 131.42775156Iteration 150, loss = 131.05940000Iteration 151, loss = 131.28351430Iteration 152, loss = 130.74260322Iteration 153, loss = 130.88466712Iteration 154, loss = 131.03646775Iteration 155, loss = 130.34557661Iteration 156, loss = 130.83447199Iteration 157, loss = 131.28845939Iteration 158, loss = 130.65785044Iteration 159, loss = 130.61223056Iteration 160, loss = 131.07589679Iteration 161, loss = 130.64325675Iteration 162, loss = 129.70704922Iteration 163, loss = 129.84506370Iteration 164, loss = 130.61988464Iteration 165, loss = 130.43265567Iteration 166, loss = 130.88822404Iteration 167, loss = 130.76778201Iteration 168, loss = 130.64819084Iteration 169, loss = 130.28019987Iteration 170, loss = 129.95417212Iteration 171, loss = 131.06510048Iteration 172, loss = 131.21377407Iteration 173, loss = 130.17368709Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.0.244249985191963412.796789671568312
这是最终的图表 这里如果我做错了什么,请指正。祝好
回答:
这样的问题实际上很难给出确切的答案,因为答案很大程度上取决于所使用的数据集,而我们没有这些数据。
尽管如此,由于你的目标变量似乎具有相当高的动态范围,你应该尝试使用一个单独的缩放器来缩放它;在计算误差或绘图之前,你需要注意将预测值逆变换回原始尺度:
sc_y = StandardScaler()y_train = sc_y.fit_transform(y_train.reshape(-1, 1))y_test = sc_y.transform(y_test.reshape(-1, 1))# 模型定义和拟合...y_pred_scaled = clf.predict(X_test) # 获取缩放后的预测值y_pred = sc_y.inverse_transform(y_pred_scaled) # 变换回原始尺度
从这一点开始,你应该能够像在你的代码中那样继续使用 y_pred
。
另外,与你的问题无关,但你缩放特征的方式是错误的。我们从不在测试数据上使用 fit_transform
;正确的方式是:
sc_x = StandardScaler()X_train = sc_x.fit_transform(X_train)X_test = sc_x.transform(X_test) # 在这里变换
如前所述,这只是一个建议;这里的关键词是实验(尝试不同的层数、每层的不同单元数、不同的缩放器等)。