打印随机森林分类器中特定样本的决策路径

如何打印随机森林中特定样本的决策路径，而不是打印随机森林中各个树的路径。

import numpy as npimport pandas as pdfrom sklearn.datasets import make_classificationfrom sklearn.ensemble import RandomForestClassifierX, y = make_classification(n_samples=1000,                           n_features=6,                           n_informative=3,                           n_classes=2,                           random_state=0,                           shuffle=False)# 创建数据框df = pd.DataFrame({'Feature 1':X[:,0],                                  'Feature 2':X[:,1],                                  'Feature 3':X[:,2],                                  'Feature 4':X[:,3],                                  'Feature 5':X[:,4],                                  'Feature 6':X[:,5],                                  'Class':y})y_train = df['Class']X_train = df.drop('Class',axis = 1)rf = RandomForestClassifier(n_estimators=10,                               random_state=0)rf.fit(X_train, y_train)

随机森林的决策路径是在v0.18版本中引入的。（http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html）

然而，它输出的稀疏矩阵让我不太明白该如何解读。谁能建议如何最好地打印那个特定样本的决策路径，然后可视化它？

# 提取实例 i = 12 的决策路径i_data = X_train.iloc[12].values.reshape(1,-1)d_path = rf.decision_path(i_data)print(d_path)

输出：

(<1×1432 sparse matrix of type ” with 96 stored elements in Compressed Sparse Row format>, array([ 0, 133, >282, 415, 588, 761, 910, 1041, 1182, 1309, 1432], dtype=int32))

回答：

我在scikit-learn文档中找到了这个代码，并对其进行了修改以适应你的问题。

由于RandomForestClassifier是由多个DecisionTreeClassifier组成的，我们可以遍历不同的树，并在每个树中检索样本的决策路径。希望这对你有帮助：

import numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_classificationfrom sklearn.ensemble import RandomForestClassifierX, y = make_classification(n_samples=1000,                           n_features=6,                           n_informative=3,                           n_classes=2,                           random_state=0,                           shuffle=False)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)estimator = RandomForestClassifier(n_estimators=10,                               random_state=0)estimator.fit(X_train, y_train)# 决策估计器有一个名为 tree_ 的属性，它存储了整个树结构，并允许访问低级属性。# 二叉树 tree_ 被表示为多个并行数组。每个数组的第 i 个元素保存关于节点 `i` 的信息。# 节点 0 是树的根节点。注意：# 有些数组仅适用于叶子节点或分裂节点，相应地，其他类型的节点的值是任意的！## 在这些数组中，我们有：#   - left_child，节点的左子节点的 id#   - right_child，节点的右子节点的 id#   - feature，用于分裂节点的特征#   - threshold，节点处的阈值## 使用这些数组，我们可以解析树结构：#n_nodes = estimator.tree_.node_countn_nodes_ = [t.tree_.node_count for t in estimator.estimators_]children_left_ = [t.tree_.children_left for t in estimator.estimators_]children_right_ = [t.tree_.children_right for t in estimator.estimators_]feature_ = [t.tree_.feature for t in estimator.estimators_]threshold_ = [t.tree_.threshold for t in estimator.estimators_]def explore_tree(estimator, n_nodes, children_left,children_right, feature,threshold,                suffix='', print_tree= False, sample_id=0, feature_names=None):    if not feature_names:        feature_names = feature    assert len(feature_names) == X.shape[1], "特征名称与特征数量不匹配。"    # 可以遍历树结构以计算各种属性，例如每个节点的深度以及它是否为叶子节点。    node_depth = np.zeros(shape=n_nodes, dtype=np.int64)    is_leaves = np.zeros(shape=n_nodes, dtype=bool)    stack = [(0, -1)]  # 种子是根节点 id 及其父节点深度    while len(stack) > 0:        node_id, parent_depth = stack.pop()        node_depth[node_id] = parent_depth + 1        # 如果我们有一个测试节点        if (children_left[node_id] != children_right[node_id]):            stack.append((children_left[node_id], parent_depth + 1))            stack.append((children_right[node_id], parent_depth + 1))        else:            is_leaves[node_id] = True    print("二叉树结构有 %s 个节点"          % n_nodes)    if print_tree:        print("树结构：\n")        for i in range(n_nodes):            if is_leaves[i]:                print("%snode=%s 叶子节点。" % (node_depth[i] * "\t", i))            else:                print("%snode=%s 测试节点：如果 X[:, %s] <= %s 则去节点 %s，否则去 "                      "节点 %s。"                      % (node_depth[i] * "\t",                         i,                         feature[i],                         threshold[i],                         children_left[i],                         children_right[i],                         ))            print("\n")        print()    # 首先让我们检索每个样本的决策路径。decision_path 方法允许检索节点指示函数。    # 指示矩阵位置 (i, j) 的非零元素表示样本 i 通过节点 j。    node_indicator = estimator.decision_path(X_test)    # 同样，我们也可以获取每个样本到达的叶子节点 id。    leave_id = estimator.apply(X_test)    # 现在，可以获取用于预测一个样本或一组样本的测试。首先，让我们对样本进行操作。    #sample_id = 0    node_index = node_indicator.indices[node_indicator.indptr[sample_id]:                                        node_indicator.indptr[sample_id + 1]]    print(X_test[sample_id,:])    print('用于预测样本 %s 的规则： ' % sample_id)    for node_id in node_index:        # tabulation = " "*node_depth[node_id] #-> 使每个树级别的缩进        tabulation = ""        if leave_id[sample_id] == node_id:            print("%s==> 预测的叶子节点索引 \n"%(tabulation))            #continue        if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):            threshold_sign = "<="        else:            threshold_sign = ">"        print("%sdecision id node %s : (X_test[%s, '%s'] (= %s) %s %s)"              % (tabulation,                 node_id,                 sample_id,                 feature_names[feature[node_id]],                 X_test[sample_id, feature[node_id]],                 threshold_sign,                 threshold[node_id]))    print("%s样本 %d 的预测： %s"%(tabulation,                                          sample_id,                                          estimator.predict(X_test)[sample_id]))    # 对于一组样本，我们有以下公共节点。    sample_ids = [sample_id, 1]    common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==                    len(sample_ids))    common_node_id = np.arange(n_nodes)[common_nodes]    print("\n以下样本 %s 在树中共享节点 %s"          % (sample_ids, common_node_id))    print("它占所有节点的 %s %%。" % (100 * len(common_node_id) / n_nodes,))    for sample_id_ in sample_ids:        print("样本 %d 的预测： %s"%(sample_id_,                                          estimator.predict(X_test)[sample_id_]))

学技术

打印随机森林分类器中特定样本的决策路径

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复