这是我从单折训练模型中获取值的方法
clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric='auc', verbose=100, early_stopping_rounds=200)import shap # package used to calculate Shap values# Create object that can calculate shap valuesexplainer = shap.TreeExplainer(clf)# Calculate Shap valuesshap_values = explainer.shap_values(X_test)shap.summary_plot(shap_values, X_test)
如您所知,不同折叠的结果可能会有所不同 – 如何平均这些shap_values?
回答:
因为我们有这样的规则:
对使用相同输入特征训练的具有相同输出的模型进行SHAP值平均是可以的,只需确保也平均每个解释器的expected_value。然而,如果您有不重叠的测试集,那么您不能对测试集的SHAP值进行平均,因为它们针对的是不同的样本。您可以使用每个模型来解释整个数据集的SHAP值,然后将这些值平均成一个单一的矩阵。(解释训练集中的例子是可以的,只是要记住您可能会对它们过度拟合)
所以我们需要一个保留数据集来遵循这一规则。我做了一些类似的事情,以使一切按预期工作:
shap_values = Nonefrom sklearn.model_selection import cross_val_score, StratifiedKFold(X_train, X_test, y_train, y_test) = train_test_split(df[feat], df['target'].values, test_size=0.2, shuffle = True,stratify =df['target'].values, random_state=42) folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)folds_idx = [(train_idx, val_idx) for train_idx, val_idx in folds.split(X_train, y=y_train)]auc_scores = []oof_preds = np.zeros(df[feat].shape[0])test_preds = []for n_fold, (train_idx, valid_idx) in enumerate(folds_idx): train_x, train_y = df[feat].iloc[train_idx], df['target'].iloc[train_idx] valid_x, valid_y = df[feat].iloc[valid_idx], df['target'].iloc[valid_idx] clf = lgb.LGBMClassifier(nthread=4, boosting_type= 'gbdt', is_unbalance= True,random_state = 42, learning_rate= 0.05, max_depth= 3, reg_lambda=0.1 , reg_alpha= 0.01,min_child_samples= 21,subsample_for_bin= 5000, metric= 'auc', n_estimators= 5000 ) clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], eval_metric='auc', verbose=False, early_stopping_rounds=100) explainer = shap.TreeExplainer(clf) if shap_values is None: shap_values = explainer.shap_values(X_test) else: shap_values += explainer.shap_values(X_test) oof_preds[valid_idx] = clf.predict_proba(valid_x)[:, 1] auc_scores.append(roc_auc_score(valid_y, oof_preds[valid_idx]))print( 'AUC: ', np.mean(auc_scores))shap_values /= 10 # number of foldsshap.summary_plot(shap_values, X_test)