我非常希望您能对我RF模型的解释以及如何一般性地评估结果提供反馈。
57658 samples 27 predictor 2 classes: 'stayed', 'left' No pre-processingResampling: Cross-Validated (10 fold) Summary of sample sizes: 11531, 11531, 11532, 11532, 11532 Resampling results across tuning parameters: mtry splitrule ROC Sens Spec 2 gini 0.6273579 0.9999011 0.0006250729 2 extratrees 0.6246980 0.9999197 0.0005667791 14 gini 0.5968382 0.9324610 0.1116113149 14 extratrees 0.6192781 0.9740323 0.0523004026 27 gini 0.5584677 0.7546156 0.2977507092 27 extratrees 0.5589923 0.7635036 0.2905489827Tuning parameter 'min.node.size' was held constant at a value of 1ROC was used to select the optimal model using the largest value.The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
在对我的Y变量的函数形式以及数据分割方式进行了几次调整后,我得到了以下结果:我的ROC略有提高,但有趣的是,我的敏感性和特异性与初始模型相比发生了巨大变化。
35000 samples 27 predictor 2 classes: 'stayed', 'left' No pre-processingResampling: Cross-Validated (10 fold) Summary of sample sizes: 7000, 7000, 7000, 7000, 7000 Resampling results across tuning parameters: mtry splitrule ROC Sens Spec 2 gini 0.6351733 0.0004618204 0.9998685 2 extratrees 0.6287926 0.0000000000 0.9999899 14 gini 0.6032979 0.1346653886 0.9170874 14 extratrees 0.6235212 0.0753069696 0.9631711 27 gini 0.5725621 0.3016414054 0.7575899 27 extratrees 0.5716616 0.2998190728 0.7636219Tuning parameter 'min.node.size' was held constant at a value of 1ROC was used to select the optimal model using the largest value.The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
这一次,我随机分割了数据,而不是按时间分割,并使用以下代码试验了几个mtry值:
```{r Cross Validation Part 1}set.seed(1992) # setting a seed for replication purposes folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal foldstune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))sapply(folds,length)
并得到了以下结果:
Random Forest 84172 samples 14 predictor 2 classes: 'stayed', 'left' No pre-processingResampling: Cross-Validated (10 fold) Summary of sample sizes: 16834, 16834, 16834, 16835, 16835 Resampling results across tuning parameters: mtry splitrule ROC Sens Spec 2 variance 0.5000000 NaN NaN 2 extratrees 0.7038724 0.3714761 0.8844723 5 variance 0.5000000 NaN NaN 5 extratrees 0.7042525 0.3870192 0.8727755 8 variance 0.5000000 NaN NaN 8 extratrees 0.7014818 0.4075797 0.8545012 10 variance 0.5000000 NaN NaN 10 extratrees 0.6956536 0.4336180 0.8310368 12 variance 0.5000000 NaN NaN 12 extratrees 0.6771292 0.4701687 0.7777730 15 variance 0.5000000 NaN NaN 15 extratrees 0.5000000 NaN NaNTuning parameter 'min.node.size' was held constant at a value of 1ROC was used to select the optimal model using the largest value.The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.
回答:
看起来您的随机森林对第二个类别“left”的预测能力几乎为零。最佳得分都具有极高的敏感性和极低的特异性,这基本上意味着您的分类器将所有内容都分类为“stayed”类别,我猜想这是多数类别。不幸的是,这非常糟糕,因为它与一个简单的分类器将所有内容都归为第一类别相差不大。
此外,我不太明白您是否只尝试了mtry的值为2、14和27,如果是这种情况,我强烈建议您尝试3到25的整个范围(最佳值很可能在中间某个位置)。
除此之外,由于性能看起来相当差(从ROC来看),我建议您在特征工程上多做一些工作,以提取更多的信息。否则,如果您对现有结果满意或认为无法提取更多信息,只需调整分类概率阈值,使您的敏感性和特异性符合您对类别的要求(您可能更在意将“stayed”误分类为“left”,反之亦然,这取决于您的问题)。
希望这对您有所帮助!