为什么XGBoost在全零数据集上返回非零预测？

我最近开发了一个完全功能的随机森林回归软件，使用的是scikit-learn的RandomForestRegressor模型，现在我有兴趣将其性能与其他库进行比较。因此，我找到了XGBoost的scikit-learn API随机森林回归，并使用一个全零的X特征和Y数据集进行了小型软件测试。

from numpy import arrayfrom xgboost import XGBRFRegressorfrom sklearn.ensemble import RandomForestRegressortree_number = 100depth = 10jobs = 1dimension = 19sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,                               n_jobs=jobs)xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,                         n_jobs=jobs)dataset = array([[0.0] * dimension, [0.0] * dimension])y_val = array([0.0, 0.0])sk_VAL.fit(dataset, y_val)xgb_VAL.fit(dataset, y_val)sk_predict = sk_VAL.predict(array([[0.0] * dimension]))xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))

令人惊讶的是，xgb_VAL模型在全零输入样本上的预测结果是非零的：

sk_prediction = [0.]xgb_prediction = [0.02500369]

我的评估或比较构建中有什么错误导致了这个结果？

回答：

似乎XGBoost在模型中包含了一个全局偏差，这个偏差被固定为0.5，而不是根据输入数据计算得出。这个问题已经在XGBoost的GitHub仓库中提出（见https://github.com/dmlc/xgboost/issues/799）。相应的超参数是base_score，如果你将其设置为零，模型将会如预期的那样预测零。

from numpy import arrayfrom xgboost import XGBRFRegressorfrom sklearn.ensemble import RandomForestRegressortree_number = 100depth = 10jobs = 1dimension = 19sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42, n_jobs=jobs)xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, base_score=0, random_state=42, n_jobs=jobs)dataset = array([[0.0] * dimension, [0.0] * dimension])y_val = array([0.0, 0.0])sk_VAL.fit(dataset, y_val)xgb_VAL.fit(dataset, y_val)sk_predict = sk_VAL.predict(array([[0.0] * dimension]))xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))#sk_prediction = [0.]#xgb_prediction = [0.]

学技术

为什么XGBoost在全零数据集上返回非零预测？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复