我在玩sklearn
,想用Open
、High
、Low
价格和Volume
来预测几天的TSLA Close
价格。我使用了一个非常基础的模型来预测收盘价,结果显示它们完全准确,我不确定这是为什么。0%的误差让我觉得我可能没有正确设置我的模型。
代码:
from os import X_OKfrom numpy.lib.shape_base import apply_along_axisimport pandas as pdfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.metrics import mean_absolute_errortsla_data_path = "/Users/simon/Documents/PythonVS/ML/TSLA.csv"tsla_data = pd.read_csv(tsla_data_path)tsla_features = ['Open','High','Low','Volume']y = tsla_data.CloseX = tsla_data[tsla_features]# define modeltesla_model = DecisionTreeRegressor(random_state = 1)# fit modeltesla_model.fit(X,y)#print resultsprint('making predictions for the following five dates')print(X.head())print('________________________________________________')print('the predictions are')print(tesla_model.predict(X.head()))print('________________________________________________')print('the error is ')print(mean_absolute_error(y.head(),tesla_model.predict(X.head())))
输出:
making predictions for the following five dates Open High Low Volume0 67.054001 67.099998 65.419998 397370001 66.223999 66.786003 65.713997 277780002 66.222000 66.251999 65.500000 123280003 65.879997 67.276001 65.737999 303725004 66.524002 67.582001 66.438004 32868500________________________________________________the predictions are[65.783997 66.258003 65.987999 66.973999 67.239998]________________________________________________the error is0.0
数据:
Date,Open,High,Low,Close,Adj_Close,Volume2019-11-26,67.054001,67.099998,65.419998,65.783997,65.783997,397370002019-11-27,66.223999,66.786003,65.713997,66.258003,66.258003,277780002019-11-29,66.222000,66.251999,65.500000,65.987999,65.987999,123280002019-12-02,65.879997,67.276001,65.737999,66.973999,66.973999,303725002019-12-03,66.524002,67.582001,66.438004,67.239998,67.239998,32868500
回答:
你犯了一个错误,你在用于训练模型的数据集上测量模型的性能。
如果你想正确评估你的模型性能,你应该将数据集分成两个部分。一个用于训练模型,另一个用于测量其性能。你可以使用sklearn.model_selection.train_test_split()
来分割数据集,如下所示:
tesla_model = DecisionTreeRegressor(random_state = 1)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)tesla_model.fit(X_train, X_test)mae = mean_absolute_error(y_test,tesla_model.predict(X_test))
看看这个维基百科页面,了解机器学习中不同数据集的作用。