我在尝试研究xgboost
的预测结果。
看起来两个具有相同路径的输入却给出了两种不同的预测结果。
我正在使用以下数据集进行运行:
f1,f2,f3,f4,f5,f6,f7,f8,y6,148,72,35,0,33.6,0.627,50,11,85,66,29,0,26.6,0.351,31,08,183,64,0,0,23.3,0.672,32,11,89,66,23,94,28.1,0.167,21,00,137,40,35,168,43.1,2.288,33,15,116,74,0,0,25.6,0.201,30,03,78,50,32,88,31.0,0.248,26,110,115,0,0,0,35.3,0.134,29,02,197,70,45,543,30.5,0.158,53,18,125,96,0,0,0.0,0.232,54,14,110,92,0,0,37.6,0.191,30,010,168,74,0,0,38.0,0.537,34,110,139,80,0,0,27.1,1.441,57,01,189,60,23,846,30.1,0.398,59,15,166,72,19,175,25.8,0.587,51,17,100,0,0,0,30.0,0.484,32,10,118,84,47,230,45.8,0.551,31,17,107,74,0,0,29.6,0.254,31,11,103,30,38,83,43.3,0.183,33,01,115,70,30,96,34.6,0.529,32,13,126,88,41,235,39.3,0.704,27,08,99,84,0,0,35.4,0.388,50,07,196,90,0,0,39.8,0.451,41,19,119,80,35,0,29.0,0.263,29,111,143,94,33,146,36.6,0.254,51,110,125,70,26,115,31.1,0.205,41,17,147,76,0,0,39.4,0.257,43,11,97,66,15,140,23.2,0.487,22,013,145,82,19,110,22.2,0.245,57,05,117,92,0,0,34.1,0.337,38,05,109,75,26,0,36.0,0.546,60,03,158,76,36,245,31.6,0.851,28,13,88,58,11,54,24.8,0.267,22,06,92,92,0,0,19.9,0.188,28,010,122,78,31,0,27.6,0.512,45,04,103,60,33,192,24.0,0.966,33,011,138,76,0,0,33.2,0.420,35,09,102,76,37,0,32.9,0.665,46,12,90,68,42,0,38.2,0.503,27,1
预测和树创建代码:
df = pd.read_csv("input.csv")x = df[['f1','f2','f3', 'f4', 'f5', 'f6','f7','f8']]y = df[['y']]X_train, X_test, y_train, y_test = train_test_split( x, y, test_size = 0.33, random_state = 42)model = XGBClassifier(n_jobs=-1)model.fit(X_train, y_train)res = model.predict(X_test)print ("X_test (first 2 rows:")print(X_test.head(2))print("Predictions (first 2 rows:")print(res[0:2]) plot_tree(model)plt.show()
输出:
X_test (first 2 rows: f1 f2 f3 f4 f5 f6 f7 f833 6 92 92 0 0 19.9 0.188 2836 11 138 76 0 0 33.2 0.420 35Predictions (first 2 rows:[0 1]
相同的两个输入有f2<146.5
和f4=0
=>进入相同的叶子节点(-0.34
)那么为什么这两个的预测结果不同呢?(0和1
)?
回答:
您绘制的图表并不是整个XGBoost模型;它只是其中的第一棵树。
要了解这是为什么,请查看plot_tree
的源代码:
def plot_tree(booster, fmap='', num_trees=0, rankdir=None, ax=None, **kwargs): """Plot specified tree.
以及文档:
num_trees
(int, default 0) – 指定目标树的序号
由此可见,如果您没有指定num_trees
参数,就像这里的情况,它会采用默认值0
,也就是集成的第一棵树。
使用不同的num_trees
值,您将得到不同的树,因此每个样本的决策路径也会不同。
您无法绘制整个提升集成的所有树(即使您能做到,这也没有什么实际用处)。plot_tree
只是一个实用函数,用于查看模型的单个树。您可以查看如何在Python中使用XGBoost可视化梯度提升决策树中的使用示例。