我最近发现了这个用于机器学习可解释性的神奇库。我决定使用sklearn中的玩具数据集构建一个简单的xgboost分类器,并绘制一个force_plot
。
为了理解这个图表,库中说明如下:
上面的解释显示了每个特征如何将模型输出从基础值(我们传递的训练数据集上的平均模型输出)推向模型输出。推高预测的特征显示为红色,降低预测的特征显示为蓝色(这些力图在我们的Nature BME论文中被引入)。
所以在我看来,base_value应该与clf.predict(X_train).mean()
相同,其值为0.637。然而,查看图表时情况并非如此,数值实际上不在[0,1]之间。我尝试使用不同的基数(10, e, 2)进行对数转换,假设这可能是一种单调变换…但仍然没有成功。我如何才能得到这个base_value?
!pip install shapfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import GradientBoostingClassifierimport pandas as pdimport shapX, y = load_breast_cancer(return_X_y=True)X = pd.DataFrame(data=X)y = pd.DataFrame(data=y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)clf = GradientBoostingClassifier(random_state=0)clf.fit(X_train, y_train)print(clf.predict(X_train).mean())# load JS visualization code to notebookshap.initjs()explainer = shap.TreeExplainer(clf)shap_values = explainer.shap_values(X_train)# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])
回答:
要在原始空间中获取base_value
(当link="identity"
时),您需要将类标签转换为概率,再转换为原始分数。请注意,默认损失是"deviance"
,因此原始分数是反向Sigmoid函数:
# probabilitesy = clf.predict_proba(X_train)[:,1]# raw scores, default link="identity"y_raw = np.log(y/(1-y))# expected raw scoreprint(np.mean(y_raw))print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))2.065861773054686[ True]
第0个数据点的相关图表在原始空间中:
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")
如果您希望切换到Sigmoid概率空间(link="logit"
):
from scipy.special import expit, logit# probabilitesy = clf.predict_proba(X_train)[:,1]# exected raw base valuey_raw = logit(y).mean()# expected probability, i.e. base value in probability spacyprint(expit(y_raw))0.8875405774316522
第0个数据点的相关图表在概率空间中:
请注意,从shap的角度来看,概率base_value
,他们称之为在没有数据时的基线概率,并不是一个合理的人在没有独立变量时会定义的值(在这种情况下为0.6373626373626373
)
完整的可复现示例:
from sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import GradientBoostingClassifierimport pandas as pdimport shapprint(shap.__version__)X, y = load_breast_cancer(return_X_y=True)X = pd.DataFrame(data=X)y = pd.DataFrame(data=y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)clf = GradientBoostingClassifier(random_state=0)clf.fit(X_train, y_train.values.ravel())# load JS visualization code to notebookshap.initjs()explainer = shap.TreeExplainer(clf, model_output="raw")shap_values = explainer.shap_values(X_train)from scipy.special import expit, logit# probabilitesy = clf.predict_proba(X_train)[:,1]# exected raw base valuey_raw = logit(y).mean()# expected probability, i.e. base value in probability spacyprint("Expected raw score (before sigmoid):", y_raw)print("Expected probability:", expit(y_raw))# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")
输出:
0.36.0Expected raw score (before sigmoid): 2.065861773054686Expected probability: 0.8875405774316522