解释随机森林模型结果

我非常希望您能对我RF模型的解释以及如何一般性地评估结果提供反馈。

57658 samples   27 predictor    2 classes: 'stayed', 'left' No pre-processingResampling: Cross-Validated (10 fold) Summary of sample sizes: 11531, 11531, 11532, 11532, 11532 Resampling results across tuning parameters:  mtry  splitrule   ROC        Sens       Spec           2    gini        0.6273579  0.9999011  0.0006250729   2    extratrees  0.6246980  0.9999197  0.0005667791  14    gini        0.5968382  0.9324610  0.1116113149  14    extratrees  0.6192781  0.9740323  0.0523004026  27    gini        0.5584677  0.7546156  0.2977507092  27    extratrees  0.5589923  0.7635036  0.2905489827Tuning parameter 'min.node.size' was held constant at a value of 1ROC was used to select the optimal model using the largest value.The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

在对我的Y变量的函数形式以及数据分割方式进行了几次调整后，我得到了以下结果：我的ROC略有提高，但有趣的是，我的敏感性和特异性与初始模型相比发生了巨大变化。

35000 samples   27 predictor    2 classes: 'stayed', 'left' No pre-processingResampling: Cross-Validated (10 fold) Summary of sample sizes: 7000, 7000, 7000, 7000, 7000 Resampling results across tuning parameters:  mtry  splitrule   ROC        Sens          Spec        2    gini        0.6351733  0.0004618204  0.9998685   2    extratrees  0.6287926  0.0000000000  0.9999899  14    gini        0.6032979  0.1346653886  0.9170874  14    extratrees  0.6235212  0.0753069696  0.9631711  27    gini        0.5725621  0.3016414054  0.7575899  27    extratrees  0.5716616  0.2998190728  0.7636219Tuning parameter 'min.node.size' was held constant at a value of 1ROC was used to select the optimal model using the largest value.The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

这一次，我随机分割了数据，而不是按时间分割，并使用以下代码试验了几个mtry值：

```{r Cross Validation Part 1}set.seed(1992) # setting a seed for replication purposes folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal foldstune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))sapply(folds,length)

并得到了以下结果：

Random Forest 84172 samples   14 predictor    2 classes: 'stayed', 'left' No pre-processingResampling: Cross-Validated (10 fold) Summary of sample sizes: 16834, 16834, 16834, 16835, 16835 Resampling results across tuning parameters:  mtry  splitrule   ROC        Sens       Spec        2    variance    0.5000000        NaN        NaN   2    extratrees  0.7038724  0.3714761  0.8844723   5    variance    0.5000000        NaN        NaN   5    extratrees  0.7042525  0.3870192  0.8727755   8    variance    0.5000000        NaN        NaN   8    extratrees  0.7014818  0.4075797  0.8545012  10    variance    0.5000000        NaN        NaN  10    extratrees  0.6956536  0.4336180  0.8310368  12    variance    0.5000000        NaN        NaN  12    extratrees  0.6771292  0.4701687  0.7777730  15    variance    0.5000000        NaN        NaN  15    extratrees  0.5000000        NaN        NaNTuning parameter 'min.node.size' was held constant at a value of 1ROC was used to select the optimal model using the largest value.The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.

回答：

看起来您的随机森林对第二个类别“left”的预测能力几乎为零。最佳得分都具有极高的敏感性和极低的特异性，这基本上意味着您的分类器将所有内容都分类为“stayed”类别，我猜想这是多数类别。不幸的是，这非常糟糕，因为它与一个简单的分类器将所有内容都归为第一类别相差不大。
此外，我不太明白您是否只尝试了mtry的值为2、14和27，如果是这种情况，我强烈建议您尝试3到25的整个范围（最佳值很可能在中间某个位置）。

除此之外，由于性能看起来相当差（从ROC来看），我建议您在特征工程上多做一些工作，以提取更多的信息。否则，如果您对现有结果满意或认为无法提取更多信息，只需调整分类概率阈值，使您的敏感性和特异性符合您对类别的要求（您可能更在意将“stayed”误分类为“left”，反之亦然，这取决于您的问题）。

希望这对您有所帮助！

学技术

解释随机森林模型结果

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复