如何解释使用SHAP时GBT分类器的base_value？

我最近发现了这个用于机器学习可解释性的神奇库。我决定使用sklearn中的玩具数据集构建一个简单的xgboost分类器，并绘制一个force_plot。

为了理解这个图表，库中说明如下：

上面的解释显示了每个特征如何将模型输出从基础值（我们传递的训练数据集上的平均模型输出）推向模型输出。推高预测的特征显示为红色，降低预测的特征显示为蓝色（这些力图在我们的Nature BME论文中被引入）。

所以在我看来，base_value应该与clf.predict(X_train).mean()相同，其值为0.637。然而，查看图表时情况并非如此，数值实际上不在[0,1]之间。我尝试使用不同的基数（10, e, 2）进行对数转换，假设这可能是一种单调变换…但仍然没有成功。我如何才能得到这个base_value？

!pip install shapfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import GradientBoostingClassifierimport pandas as pdimport shapX, y = load_breast_cancer(return_X_y=True)X = pd.DataFrame(data=X)y = pd.DataFrame(data=y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)clf = GradientBoostingClassifier(random_state=0)clf.fit(X_train, y_train)print(clf.predict(X_train).mean())# load JS visualization code to notebookshap.initjs()explainer = shap.TreeExplainer(clf)shap_values = explainer.shap_values(X_train)# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])

回答：

要在原始空间中获取base_value（当link="identity"时），您需要将类标签转换为概率，再转换为原始分数。请注意，默认损失是"deviance"，因此原始分数是反向Sigmoid函数：

# probabilitesy = clf.predict_proba(X_train)[:,1]# raw scores, default link="identity"y_raw = np.log(y/(1-y))# expected raw scoreprint(np.mean(y_raw))print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))2.065861773054686[ True]

第0个数据点的相关图表在原始空间中：

shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")

如果您希望切换到Sigmoid概率空间（link="logit"）：

from scipy.special import expit, logit# probabilitesy = clf.predict_proba(X_train)[:,1]# exected raw base valuey_raw = logit(y).mean()# expected probability, i.e. base value in probability spacyprint(expit(y_raw))0.8875405774316522

第0个数据点的相关图表在概率空间中：

请注意，从shap的角度来看，概率base_value，他们称之为在没有数据时的基线概率，并不是一个合理的人在没有独立变量时会定义的值（在这种情况下为0.6373626373626373）

完整的可复现示例：

from sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import GradientBoostingClassifierimport pandas as pdimport shapprint(shap.__version__)X, y = load_breast_cancer(return_X_y=True)X = pd.DataFrame(data=X)y = pd.DataFrame(data=y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)clf = GradientBoostingClassifier(random_state=0)clf.fit(X_train, y_train.values.ravel())# load JS visualization code to notebookshap.initjs()explainer = shap.TreeExplainer(clf, model_output="raw")shap_values = explainer.shap_values(X_train)from scipy.special import expit, logit# probabilitesy = clf.predict_proba(X_train)[:,1]# exected raw base valuey_raw = logit(y).mean()# expected probability, i.e. base value in probability spacyprint("Expected raw score (before sigmoid):", y_raw)print("Expected probability:", expit(y_raw))# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")

输出：

0.36.0Expected raw score (before sigmoid): 2.065861773054686Expected probability: 0.8875405774316522

学技术

如何解释使用SHAP时GBT分类器的base_value？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复