使用caret构建随机森林

我尝试按照这里的步骤在caret中构建一个随机森林模型。本质上，他们先设置随机森林，然后是最佳的mtry，然后是最佳的maxnodes，最后是最佳的树木数量。这些步骤是有道理的，但一次搜索这三个因素的交互作用不是更好吗？

其次，我理解对mtry和ntrees进行网格搜索。但是我不知道应该将节点的最小数量或最大数量设置为多少。一般建议像下面显示的那样保留默认的nodesize吗？

library(randomForest)library(caret)mtrys<-seq(1,4,1)ntrees<-c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)combo_mtrTrees<-data.frame(expand.grid(mtrys, ntrees))colnames(combo_mtrTrees)<-c('mtrys','ntrees')tuneGrid <- expand.grid(.mtry = c(1: 4))for (i in 1:length(ntrees)){  ntree<-ntrees[i]  set.seed(65)  rf_maxtrees <- train(Species~.,                       data = df,                       method = "rf",                       importance=TRUE,                       metric = "Accuracy",                       tuneGrid = tuneGrid,                       trControl = trainControl( method = "cv",                                                 number=5,                                                 search = 'grid',                                                 classProbs = TRUE,                                                 savePredictions = "final"),                       ntree = ntree                       )  Acc1<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==1]  Acc2<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==2]  Acc3<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==3]  Acc4<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==4]  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==1 & combo_mtrTrees$ntrees==ntree]<-Acc1  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==2 & combo_mtrTrees$ntrees==ntree]<-Acc2  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==3 & combo_mtrTrees$ntrees==ntree]<-Acc3  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==4 & combo_mtrTrees$ntrees==ntree]<-Acc4}

回答：

是的，最好是搜索参数的交互作用。
nodesize和maxnodes通常保持默认值，但没有理由不进行调优。我个人会将maxnodes保持默认值，并可能调优nodesize – 它可以被视为正则化参数。要了解应该尝试哪些值，可以查看rf中的默认值，这些值对于分类是1，对于回归是5。所以尝试1到10是一个选项。
在像你示例中那样在循环中进行调优时，建议始终使用相同的交叉验证折叠。你可以在调用循环之前使用createFolds创建它们。
调优后，确保在独立的验证集上评估你的结果，或者执行嵌套交叉验证，其中内部循环用于调优参数，外部循环用于估计模型性能。因为仅通过交叉验证得到的结果会过于乐观。
在大多数情况下，准确率并不是选择最佳分类模型的合适指标。特别是在数据集不平衡的情况下。阅读有关接收者操作特征曲线AUC、Cohen’s kappa、Matthews相关系数、平衡准确率、F1分数、分类阈值调优的内容。
这里是一个如何联合调优rf参数的示例。我将使用mlbench包中的Sonar数据集。

创建预定义折叠:

library(caret) library(mlbench)data(Sonar)set.seed(1234)cv_folds <- createFolds(Sonar$Class, k = 5, returnTrain = TRUE)

创建调优控制:

tuneGrid <- expand.grid(.mtry = c(1 : 10))ctrl <- trainControl(method = "cv",                     number = 5,                     search = 'grid',                     classProbs = TRUE,                     savePredictions = "final",                     index = cv_folds,                     summaryFunction = twoClassSummary) #在大多数情况下，对于两类问题，这是更好的摘要函数

定义其他要调优的参数。我将仅使用几个组合来限制示例的训练时间:

ntrees <- c(500, 1000)    nodesize <- c(1, 5)params <- expand.grid(ntrees = ntrees,                      nodesize = nodesize)

训练:

store_maxnode <- vector("list", nrow(params))for(i in 1:nrow(params)){  nodesize <- params[i,2]  ntree <- params[i,1]  set.seed(65)  rf_model <- train(Class~.,                       data = Sonar,                       method = "rf",                       importance=TRUE,                       metric = "ROC",                       tuneGrid = tuneGrid,                       trControl = ctrl,                       ntree = ntree,                       nodesize = nodesize)  store_maxnode[[i]] <- rf_model  }

################### 26.02.2021.

为了避免使用通用的模型名称 – model1, model2 … 我们可以用相应的参数命名结果列表:

names(store_maxnode) <- paste("ntrees:", params$ntrees,                              "nodesize:", params$nodesize)

################### 26.02.2021.

合并结果:

results_mtry <- resamples(store_maxnode)summary(results_mtry)

输出:

Call:summary.resamples(object = results_mtry)Models: ntrees: 500 nodesize: 1, ntrees: 1000 nodesize: 1, ntrees: 500 nodesize: 5, ntrees: 1000 nodesize: 5 Number of resamples: 5 ROC                               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA'sntrees: 500 nodesize: 1  0.9108696 0.9354067 0.9449761 0.9465758 0.9688995 0.9727273    0ntrees: 1000 nodesize: 1 0.8847826 0.9473684 0.9569378 0.9474828 0.9665072 0.9818182    0ntrees: 500 nodesize: 5  0.9163043 0.9377990 0.9569378 0.9481652 0.9593301 0.9704545    0ntrees: 1000 nodesize: 5 0.9000000 0.9342105 0.9521531 0.9462321 0.9641148 0.9806818    0Sens                               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA'sntrees: 500 nodesize: 1  0.9090909 0.9545455 0.9545455 0.9549407 0.9565217 1.0000000    0ntrees: 1000 nodesize: 1 0.9090909 0.9130435 0.9545455 0.9371542 0.9545455 0.9545455    0ntrees: 500 nodesize: 5  0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217    0ntrees: 1000 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217    0Spec                          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA'sntrees: 500 nodesize: 1  0.65 0.6842105 0.7368421 0.7421053 0.7894737 0.8500000    0ntrees: 1000 nodesize: 1 0.60 0.6842105 0.7894737 0.7631579 0.8421053 0.9000000    0ntrees: 500 nodesize: 5  0.55 0.6842105 0.7894737 0.7331579 0.8000000 0.8421053    0ntrees: 1000 nodesize: 5 0.60 0.6842105 0.7368421 0.7321053 0.7894737 0.8500000    0

学技术

使用caret构建随机森林

发表回复取消回复

相关文章：

Related Posts

神经网络反向传播代码不工作

值错误：y 包含先前未见过的标签：

使用不平衡数据集进行特征选择时遇到的问题

广义随机森林/因果森林在Python上的应用

如何用PyTorch仅用标量损失来训练神经网络？

什么是RNN中间隐藏状态的良好用途？

发表回复 取消回复

发表回复取消回复