我最近开发了一个完全功能的随机森林回归软件,使用的是scikit-learn的RandomForestRegressor模型,现在我有兴趣将其性能与其他库进行比较。因此,我找到了XGBoost的scikit-learn API随机森林回归,并使用一个全零的X特征和Y数据集进行了小型软件测试。
from numpy import arrayfrom xgboost import XGBRFRegressorfrom sklearn.ensemble import RandomForestRegressortree_number = 100depth = 10jobs = 1dimension = 19sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42, n_jobs=jobs)xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, random_state=42, n_jobs=jobs)dataset = array([[0.0] * dimension, [0.0] * dimension])y_val = array([0.0, 0.0])sk_VAL.fit(dataset, y_val)xgb_VAL.fit(dataset, y_val)sk_predict = sk_VAL.predict(array([[0.0] * dimension]))xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))
令人惊讶的是,xgb_VAL模型在全零输入样本上的预测结果是非零的:
sk_prediction = [0.]xgb_prediction = [0.02500369]
我的评估或比较构建中有什么错误导致了这个结果?
回答:
似乎XGBoost在模型中包含了一个全局偏差,这个偏差被固定为0.5,而不是根据输入数据计算得出。这个问题已经在XGBoost的GitHub仓库中提出(见https://github.com/dmlc/xgboost/issues/799)。相应的超参数是base_score
,如果你将其设置为零,模型将会如预期的那样预测零。
from numpy import arrayfrom xgboost import XGBRFRegressorfrom sklearn.ensemble import RandomForestRegressortree_number = 100depth = 10jobs = 1dimension = 19sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42, n_jobs=jobs)xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, base_score=0, random_state=42, n_jobs=jobs)dataset = array([[0.0] * dimension, [0.0] * dimension])y_val = array([0.0, 0.0])sk_VAL.fit(dataset, y_val)xgb_VAL.fit(dataset, y_val)sk_predict = sk_VAL.predict(array([[0.0] * dimension]))xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))#sk_prediction = [0.]#xgb_prediction = [0.]