Scala 随机森林特征重要性提取与名称(标签)

有没有办法从模型中提取特征重要性,并附加featureCols名称以便于分析?

我有类似这样的代码:

val featureCols = Array("a","b","c".......... 还有67个)val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")val df2 = assembler.transform(modeling_db)val labelIndexer = new StringIndexer().setInputCol("def").setOutputCol("label")val df3 = labelIndexer.fit(df2).transform(df2)val splitSeed = 5043val Array(trainingData, testDataCE) = df3.randomSplit(Array(0.7, 0.3), splitSeed)val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(19).setNumTrees(57).setFeatureSubsetStrategy("auto").setSeed(5043)val model = classifier.fit(trainingData)

之后,我们尝试提取重要性,使用:

model.featureImportances

但结果很难分析:

res14: org.apache.spark.mllib.linalg.Vector = (71,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,23,25,27,33,34,35,38,39,41,42,45,47,48,49,50,51,52,53,54,55,56,57,58,60,61,62,63,64,65,66,67,68,69,70],[0.22362951804309808,0.1830148359365108,0.10246542303449771,0.1699399958851977,0.06486419413350401,0.05187244974385025,0.02627047699833213,0.014498050071723645,0.026182513062665076,0.007126662761055224,0.012375060477018274,0.004354513006816487,0.004361008357237427,0.008435852744278544,0.003195472326415685,0.0023071401643885753,0.004602370417578224,0.0030394399903992345,6.92408316823549E-4,0.011207695216651398,7.609910745572573E-4,8.316382113306638E-4,0.0021506289318167916,0.0013468620354363688,0.006968754359778437,0.018796331618729723,0.0024516591941419444,0.005980997035580654,0.0027983...

有没有办法“解包”这个结果并将其附加到原始标签名称上?


回答:

您在featureCols中有原始列名,看起来没有涉及任何向量,因此您可以简单地将这两个数组zip在一起。对于这样的输入数据:

val featureCols = Array("a", "b", "c", "d", "e")val featureImportance = Vectors.dense(Array(0.15, 0.25, 0.1, 0.35, 0.15)).toSparse

只需执行

val res = featureCols.zip(featureImportance.toArray).sortBy(-_._2)

打印结果将是

(d,0.35)(b,0.25)(a,0.15)(e,0.15)(c,0.1)

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注