解释随机森林模型结果

我非常希望您能对我RF模型的解释以及如何一般性地评估结果提供反馈。

57658 samples   27 predictor    2 classes: 'stayed', 'left' No pre-processingResampling: Cross-Validated (10 fold) Summary of sample sizes: 11531, 11531, 11532, 11532, 11532 Resampling results across tuning parameters:  mtry  splitrule   ROC        Sens       Spec           2    gini        0.6273579  0.9999011  0.0006250729   2    extratrees  0.6246980  0.9999197  0.0005667791  14    gini        0.5968382  0.9324610  0.1116113149  14    extratrees  0.6192781  0.9740323  0.0523004026  27    gini        0.5584677  0.7546156  0.2977507092  27    extratrees  0.5589923  0.7635036  0.2905489827Tuning parameter 'min.node.size' was held constant at a value of 1ROC was used to select the optimal model using the largest value.The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

在对我的Y变量的函数形式以及数据分割方式进行了几次调整后,我得到了以下结果:我的ROC略有提高,但有趣的是,我的敏感性和特异性与初始模型相比发生了巨大变化。

35000 samples   27 predictor    2 classes: 'stayed', 'left' No pre-processingResampling: Cross-Validated (10 fold) Summary of sample sizes: 7000, 7000, 7000, 7000, 7000 Resampling results across tuning parameters:  mtry  splitrule   ROC        Sens          Spec        2    gini        0.6351733  0.0004618204  0.9998685   2    extratrees  0.6287926  0.0000000000  0.9999899  14    gini        0.6032979  0.1346653886  0.9170874  14    extratrees  0.6235212  0.0753069696  0.9631711  27    gini        0.5725621  0.3016414054  0.7575899  27    extratrees  0.5716616  0.2998190728  0.7636219Tuning parameter 'min.node.size' was held constant at a value of 1ROC was used to select the optimal model using the largest value.The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

这一次,我随机分割了数据,而不是按时间分割,并使用以下代码试验了几个mtry值:

```{r Cross Validation Part 1}set.seed(1992) # setting a seed for replication purposes folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal foldstune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))sapply(folds,length)

并得到了以下结果:

Random Forest 84172 samples   14 predictor    2 classes: 'stayed', 'left' No pre-processingResampling: Cross-Validated (10 fold) Summary of sample sizes: 16834, 16834, 16834, 16835, 16835 Resampling results across tuning parameters:  mtry  splitrule   ROC        Sens       Spec        2    variance    0.5000000        NaN        NaN   2    extratrees  0.7038724  0.3714761  0.8844723   5    variance    0.5000000        NaN        NaN   5    extratrees  0.7042525  0.3870192  0.8727755   8    variance    0.5000000        NaN        NaN   8    extratrees  0.7014818  0.4075797  0.8545012  10    variance    0.5000000        NaN        NaN  10    extratrees  0.6956536  0.4336180  0.8310368  12    variance    0.5000000        NaN        NaN  12    extratrees  0.6771292  0.4701687  0.7777730  15    variance    0.5000000        NaN        NaN  15    extratrees  0.5000000        NaN        NaNTuning parameter 'min.node.size' was held constant at a value of 1ROC was used to select the optimal model using the largest value.The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.

回答:

看起来您的随机森林对第二个类别“left”的预测能力几乎为零。最佳得分都具有极高的敏感性和极低的特异性,这基本上意味着您的分类器将所有内容都分类为“stayed”类别,我猜想这是多数类别。不幸的是,这非常糟糕,因为它与一个简单的分类器将所有内容都归为第一类别相差不大。
此外,我不太明白您是否只尝试了mtry的值为2、14和27,如果是这种情况,我强烈建议您尝试3到25的整个范围(最佳值很可能在中间某个位置)。

除此之外,由于性能看起来相当差(从ROC来看),我建议您在特征工程上多做一些工作,以提取更多的信息。否则,如果您对现有结果满意或认为无法提取更多信息,只需调整分类概率阈值,使您的敏感性和特异性符合您对类别的要求(您可能更在意将“stayed”误分类为“left”,反之亦然,这取决于您的问题)。

希望这对您有所帮助!

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注