我在R中使用mlr
包来比较两个学习器,即随机森林和Lasso分类器,用于二分类任务。我使用嵌套交叉验证来计算性能。然后,我想计算最佳分类器(在本例中为随机森林)的特征重要性。为此,我使用了generateFeatureImportanceData()
,其作用是:“通过对比预测性能来估计单个特征或特征组的重要性。对于‘置换重要性’方法,计算特征(或特征组)值置换后性能的变化,并将其与未置换数据的预测进行比较。”由于我指定了measure = auc
,输出中的res
是否提供了每个特征在其值置换后的auc
减少值?
library(easypackages)
libraries("mlr","purrr","glmnet","parallelMap","parallel")data = read.table("data_past.txt", h = T)set.seed(123)task = makeClassifTask(id = "past_history", data = data, target = "DIAG", positive = "BD")#指定随机森林的超参数ps_rf = makeParamSet(makeIntegerParam("mtry", lower = 4, upper = 16),makeDiscreteParam("ntree", values = 1000))ctrl_rf = makeTuneControlRandom(maxit = 10L)inner = makeResampleDesc("RepCV", fold = 10, reps = 3, stratify = TRUE)lrn_rf = makeLearner("classif.randomForest", predict.type = "prob", fix.factors.prediction = TRUE)lrn_rf = makeTuneWrapper(lrn_rf, resampling = inner, par.set = ps_rf, control = ctrl_rf, measures = auc, show.info = FALSE)parallelStartMulticore(36)ft_im = generateFeatureImportanceData(task = task, method = "permutation.importance", learner = lrn_rf, measure = auc) parallelStop()t(ft_im$res) aucINC2_A 0.000000e+00INC2_B 0.000000e+00INC2_F 0.000000e+00INC2_G 0.000000e+00INC2_H 0.000000e+00INC2_I 0.000000e+00SEX 0.000000e+00marital -3.211696e-07inpatient 0.000000e+00CMS_1 0.000000e+00CMS_2 0.000000e+00CMS_3 0.000000e+00CMS_4 0.000000e+00CMS_5 0.000000e+00CMS_6 0.000000e+00CMS_7 0.000000e+00CMS_8 0.000000e+00CMS_9 0.000000e+00CMS_10 0.000000e+00CMS_11 0.000000e+00CMS_12 0.000000e+00CMS_13 0.000000e+00CMS_14 0.000000e+00OCS_1 0.000000e+00OCS_2 0.000000e+00OCS_3 0.000000e+00OCS_4 0.000000e+00OCS_5 0.000000e+00OCS_6 0.000000e+00OCS_7 0.000000e+00OCS_8 0.000000e+00OCS_9 0.000000e+00OCS_10 0.000000e+00OCS_11 0.000000e+00reta 0.000000e+00MH_F1 -1.051220e-03CP_1BA 0.000000e+00CP_1BS 0.000000e+00MIXCLINICAL3 0.000000e+00MIXCLINICAL2 0.000000e+00MIXDS52Simpt 0.000000e+00MIXDS53Simpt 0.000000e+00PAN 0.000000e+00OBS 0.000000e+00PHO 0.000000e+00GAD 0.000000e+00EAT_0 0.000000e+00ADHD 0.000000e+00BORDERLINEPERSONALITY 0.000000e+00AlcoolProbUse 0.000000e+00SubstanceProbUse 0.000000e+00BMI -2.954760e-06DEP_AGE -7.996641e-04NBD_P -1.669455e-03NBDEP -8.671578e-06NBSUI -2.055485e-06NBHOS -8.091225e-03DURDEP -1.380869e-04SEV_M -3.083132e-03SEV_D 0.000000e+00CMS_sum 0.000000e+00TOTMIXDSM5 0.000000e+00GAF -1.170663e-05Age -1.172269e-06Comorbidities_sum 0.000000e+00
绝对值最高的特征是否是最重要的特征?auc
值为零是否意味着该特征对于当前分类任务无关紧要?谢谢。
回答:
一个特征的得分是通过从模型的正常预测得分中减去使用置换特征获得的预测得分来获得的。
因此,AUC下降值为0的特征在意义上是无关紧要的,因为它们不带来任何附加价值(它们与纯随机噪声一样重要)。另一方面,绝对值最高的特征是最重要的,因为改变它们对得分的影响最大。