大家好,我正在尝试使用for循环搜索最佳参数。然而,结果让我很困惑。以下代码应该提供相同的结果,因为参数“mtry”相同。
gender Partner tenure Churn3521 Male No 0.992313 Yes2525.1 Male No 4.276666 No567 Male Yes 2.708050 No8381 Female No 4.202127 Yes6258 Female No 0.000000 Yes6569 Male Yes 2.079442 No27410 Female No 1.550804 Yes6429 Female No 1.791759 Yes412 Female Yes 3.828641 No4655 Female Yes 3.737670 No
RFModel = randomForest(Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)print(RFModel$confusion) No Yes class.errorNo 4 1 0.2Yes 1 4 0.2
for(i in c(2)){ RFModel = randomForest(Churn ~ ., data = Trainingds, ntree = 30, mtry = i, importance = TRUE, replace = FALSE) print(RFModel$confusion)} No Yes class.errorNo 3 2 0.4Yes 2 3 0.4
- 代码1和代码2应该提供相同的输出结果。
回答:
每次运行时,您会得到略有不同的结果,因为随机性是算法内置的。为了构建每一棵树,算法会重新抽样数据框,并随机选择mtry
列从重新抽样的数据框中构建树。如果您希望使用相同参数(例如,mtry,ntree)构建的模型每次都提供相同的结果,您需要设置一个随机种子。
例如,让我们运行10次randomForest
并检查每次运行的均方误差的平均值。请注意,每次的均方误差均值是不同的:
library(randomForest)replicate(10, mean(randomForest(mpg ~ ., data=mtcars)$mse))
[1] 5.998530 6.307782 5.791657 6.125588 5.868717 5.845616 5.427208 6.112762 5.777624 6.150021
如果您运行上述代码,您将得到另外10个与上述值不同的数值。
如果您希望能够重现使用相同参数(例如,mtry
和ntree
)运行的模型结果,您可以设置一个随机种子。例如:
set.seed(5)mean(randomForest(mpg ~ ., data=mtcars)$mse)
[1] 6.017737
如果您使用相同的种子值,您将得到相同的结果,否则将得到不同的结果。使用更大的ntree
值将减少,但不会消除模型运行之间的变异性。
更新: 当我使用您提供的数据样本运行您的代码时,我并不总是得到每次相同的结果。即使使用replace=TRUE
,这会导致数据框被无放回抽样,用于构建树的列每次也可能不同:
> randomForest(Churn ~ .,+ data = ggg,+ ntree = 30,+ mtry = 2,+ importance = TRUE,+ replace = FALSE)Call: randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE) Type of random forest: classification Number of trees: 30No. of variables tried at each split: 2 OOB estimate of error rate: 30%Confusion matrix: No Yes class.errorNo 3 2 0.4Yes 1 4 0.2> randomForest(Churn ~ .,+ data = ggg,+ ntree = 30,+ mtry = 2,+ importance = TRUE,+ replace = FALSE)Call: randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE) Type of random forest: classification Number of trees: 30No. of variables tried at each split: 2 OOB estimate of error rate: 20%Confusion matrix: No Yes class.errorNo 4 1 0.2Yes 1 4 0.2> randomForest(Churn ~ .,+ data = ggg,+ ntree = 30,+ mtry = 2,+ importance = TRUE,+ replace = FALSE)Call: randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE) Type of random forest: classification Number of trees: 30No. of variables tried at each split: 2 OOB estimate of error rate: 30%Confusion matrix: No Yes class.errorNo 3 2 0.4Yes 1 4 0.2
以下是使用内置的iris
数据框获得的类似结果:
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,+ replace = FALSE)Call: randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE) Type of random forest: classification Number of trees: 30No. of variables tried at each split: 2 OOB estimate of error rate: 3.33%Confusion matrix: setosa versicolor virginica class.errorsetosa 50 0 0 0.00versicolor 0 47 3 0.06virginica 0 2 48 0.04> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,+ replace = FALSE)Call: randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE) Type of random forest: classification Number of trees: 30No. of variables tried at each split: 2 OOB estimate of error rate: 4.67%Confusion matrix: setosa versicolor virginica class.errorsetosa 50 0 0 0.00versicolor 0 47 3 0.06virginica 0 4 46 0.08> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,+ replace = FALSE)Call: randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE) Type of random forest: classification Number of trees: 30No. of variables tried at each split: 2 OOB estimate of error rate: 6%Confusion matrix: setosa versicolor virginica class.errorsetosa 50 0 0 0.00versicolor 0 47 3 0.06virginica 0 6 44 0.12