我有一个包含几个变量和湿度读数的时间序列索引。我已经训练了一个机器学习模型来根据X、Y和Z预测湿度值。现在,当我使用pickle加载保存的模型时,我想使用X、Y和Z来填充湿度缺失值。然而,应该考虑到X、Y和Z本身不应该缺失。
Time X Y Z Humidity1/2/2017 13:00 31 22 21 481/2/2017 14:00 NaN 12 NaN NaN1/2/2017 15:00 25 55 33 NaN
在这个例子中,最后一行的湿度值将使用模型填充。而第二行不应由模型预测,因为X和Z也缺失。
我目前尝试的方法如下:
with open('model_pickle','rb') as f: mp = pickle.load(f)for i, value in enumerate(df['Humidity'].values): if np.isnan(value): df['Humidity'][i] = mp.predict(df['X'][i],df['Y'][i],df['Z'][i])
这导致了错误’predict() takes from 2 to 5 positional arguments but 6 were given’,而且我也没有考虑X、Y和Z列的值。以下是我用来训练模型并将其保存到文件的代码:
df = df.dropna()dfTest = df.loc['2017-01-01':'2019-02-28']dfTrain = df.loc['2019-03-01':'2019-03-18'] features = [ 'X', 'Y', 'Z'] train_X = dfTrain[features]train_y = dfTrain.Humiditytest_X = dfTest[features]test_y = dfTest.Humiditymodel = xgb.XGBRegressor(max_depth=10,learning_rate=0.07)model.fit(train_X,train_y)predXGB = model.predict(test_X)mae = mean_absolute_error(predXGB,test_y)import picklewith open('model_pickle','wb') as f: pickle.dump(model,f)
在训练和保存模型时没有遇到错误。
回答:
对于预测,既然你想确保有所有的X、Y、Z值,你可以这样做,
df = df.dropna(subset = ["X", "Y", "Z"])
现在你可以对剩余的有效示例进行预测,
# where features = ["X", "Y", "Z"]df['Humidity'] = mp.predict(df[features])
mp.predict将返回所有行的预测结果,因此没有必要逐行预测。
编辑:.
对于推理,假设你有一个数据框df
,你可以这样做,
# Get rows with missing Humidity where it can be predicted.df_inference = df[df.Humidity.isnull()]# remaining rowsdf = df[df.Humidity.notnull()]# This might still have rows with missing features.# Since you cannot infer with missing features, Remove them too and add them to remaining rowsdf = df.append(df_inference[df_inference[features].isnull().any(1)])# and remove them from df_inferencedf_inference = df_inference[~df_inference[features].isnull().any(1)]#Now you can infer on these rowsdf_inference['Humidity'] = mp.predict(df_inference[features])# Now you can merge this back to the remaining rows to get the original number of rows and sort the rows by indexdf = df.append(df_inference)df.sort_index()