我尝试按照这里的步骤在caret中构建一个随机森林模型。本质上,他们先设置随机森林,然后是最佳的mtry,然后是最佳的maxnodes,最后是最佳的树木数量。这些步骤是有道理的,但一次搜索这三个因素的交互作用不是更好吗?
其次,我理解对mtry和ntrees进行网格搜索。但是我不知道应该将节点的最小数量或最大数量设置为多少。一般建议像下面显示的那样保留默认的nodesize吗?
library(randomForest)library(caret)mtrys<-seq(1,4,1)ntrees<-c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)combo_mtrTrees<-data.frame(expand.grid(mtrys, ntrees))colnames(combo_mtrTrees)<-c('mtrys','ntrees')tuneGrid <- expand.grid(.mtry = c(1: 4))for (i in 1:length(ntrees)){ ntree<-ntrees[i] set.seed(65) rf_maxtrees <- train(Species~., data = df, method = "rf", importance=TRUE, metric = "Accuracy", tuneGrid = tuneGrid, trControl = trainControl( method = "cv", number=5, search = 'grid', classProbs = TRUE, savePredictions = "final"), ntree = ntree ) Acc1<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==1] Acc2<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==2] Acc3<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==3] Acc4<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==4] combo_mtrTrees$Acc[combo_mtrTrees$mtrys==1 & combo_mtrTrees$ntrees==ntree]<-Acc1 combo_mtrTrees$Acc[combo_mtrTrees$mtrys==2 & combo_mtrTrees$ntrees==ntree]<-Acc2 combo_mtrTrees$Acc[combo_mtrTrees$mtrys==3 & combo_mtrTrees$ntrees==ntree]<-Acc3 combo_mtrTrees$Acc[combo_mtrTrees$mtrys==4 & combo_mtrTrees$ntrees==ntree]<-Acc4}
回答:
-
是的,最好是搜索参数的交互作用。
-
nodesize
和maxnodes
通常保持默认值,但没有理由不进行调优。我个人会将maxnodes
保持默认值,并可能调优nodesize
– 它可以被视为正则化参数。要了解应该尝试哪些值,可以查看rf
中的默认值,这些值对于分类是1,对于回归是5。所以尝试1到10是一个选项。 -
在像你示例中那样在循环中进行调优时,建议始终使用相同的交叉验证折叠。你可以在调用循环之前使用
createFolds
创建它们。 -
调优后,确保在独立的验证集上评估你的结果,或者执行嵌套交叉验证,其中内部循环用于调优参数,外部循环用于估计模型性能。因为仅通过交叉验证得到的结果会过于乐观。
-
在大多数情况下,准确率并不是选择最佳分类模型的合适指标。特别是在数据集不平衡的情况下。阅读有关接收者操作特征曲线AUC、Cohen’s kappa、Matthews相关系数、平衡准确率、F1分数、分类阈值调优的内容。
-
这里是一个如何联合调优
rf
参数的示例。我将使用mlbench
包中的Sonar数据集。
创建预定义折叠:
library(caret) library(mlbench)data(Sonar)set.seed(1234)cv_folds <- createFolds(Sonar$Class, k = 5, returnTrain = TRUE)
创建调优控制:
tuneGrid <- expand.grid(.mtry = c(1 : 10))ctrl <- trainControl(method = "cv", number = 5, search = 'grid', classProbs = TRUE, savePredictions = "final", index = cv_folds, summaryFunction = twoClassSummary) #在大多数情况下,对于两类问题,这是更好的摘要函数
定义其他要调优的参数。我将仅使用几个组合来限制示例的训练时间:
ntrees <- c(500, 1000) nodesize <- c(1, 5)params <- expand.grid(ntrees = ntrees, nodesize = nodesize)
训练:
store_maxnode <- vector("list", nrow(params))for(i in 1:nrow(params)){ nodesize <- params[i,2] ntree <- params[i,1] set.seed(65) rf_model <- train(Class~., data = Sonar, method = "rf", importance=TRUE, metric = "ROC", tuneGrid = tuneGrid, trControl = ctrl, ntree = ntree, nodesize = nodesize) store_maxnode[[i]] <- rf_model }
################### 26.02.2021.
为了避免使用通用的模型名称 – model1, model2 … 我们可以用相应的参数命名结果列表:
names(store_maxnode) <- paste("ntrees:", params$ntrees, "nodesize:", params$nodesize)
################### 26.02.2021.
合并结果:
results_mtry <- resamples(store_maxnode)summary(results_mtry)
输出:
Call:summary.resamples(object = results_mtry)Models: ntrees: 500 nodesize: 1, ntrees: 1000 nodesize: 1, ntrees: 500 nodesize: 5, ntrees: 1000 nodesize: 5 Number of resamples: 5 ROC Min. 1st Qu. Median Mean 3rd Qu. Max. NA'sntrees: 500 nodesize: 1 0.9108696 0.9354067 0.9449761 0.9465758 0.9688995 0.9727273 0ntrees: 1000 nodesize: 1 0.8847826 0.9473684 0.9569378 0.9474828 0.9665072 0.9818182 0ntrees: 500 nodesize: 5 0.9163043 0.9377990 0.9569378 0.9481652 0.9593301 0.9704545 0ntrees: 1000 nodesize: 5 0.9000000 0.9342105 0.9521531 0.9462321 0.9641148 0.9806818 0Sens Min. 1st Qu. Median Mean 3rd Qu. Max. NA'sntrees: 500 nodesize: 1 0.9090909 0.9545455 0.9545455 0.9549407 0.9565217 1.0000000 0ntrees: 1000 nodesize: 1 0.9090909 0.9130435 0.9545455 0.9371542 0.9545455 0.9545455 0ntrees: 500 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217 0ntrees: 1000 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217 0Spec Min. 1st Qu. Median Mean 3rd Qu. Max. NA'sntrees: 500 nodesize: 1 0.65 0.6842105 0.7368421 0.7421053 0.7894737 0.8500000 0ntrees: 1000 nodesize: 1 0.60 0.6842105 0.7894737 0.7631579 0.8421053 0.9000000 0ntrees: 500 nodesize: 5 0.55 0.6842105 0.7894737 0.7331579 0.8000000 0.8421053 0ntrees: 1000 nodesize: 5 0.60 0.6842105 0.7368421 0.7321053 0.7894737 0.8500000 0