我正在对我的数据进行knn
回归分析,并且希望做到以下几点:
a) 通过repeatedcv
进行交叉验证以找到最佳的k
值;
b) 在构建knn模型时,使用PCA
在90%
的阈值水平上进行降维处理。
library(caret)library(dplyr)set.seed(0)data = cbind(rnorm(20, 100, 10), matrix(rnorm(400, 10, 5), ncol = 20)) %>% data.frame()colnames(data) = c('True', paste0('Day',1:20))tr = data[1:15, ] #训练集tt = data[16:20,] #测试集train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)k = train(True ~ ., method = "knn", tuneGrid = expand.grid(k = 1:10), #尝试从1到10找到最佳的k值 trControl = train.control, preProcess = c('scale','pca'), metric = "RMSE", data = tr)
我的问题如下:
(1) 我注意到有人建议在trainControl中更改pca参数:
ctrl <- trainControl(preProcOptions = list(thresh = 0.8))mod <- train(Class ~ ., data = Sonar, method = "pls", trControl = ctrl)
如果我在trainControl中更改参数,是否意味着在KNN过程中仍会进行PCA?与此问题类似
(2) 我找到了另一个例子,符合我的情况 – 我希望将阈值改为90%,但我不知道在Caret
的train
函数中如何更改,特别是我还需要scale
选项。
我为我的冗长描述和随机引用表示歉意。提前感谢您!
(感谢@Camille提供的建议,使代码能够运行!)
回答:
回答您的问题:
我注意到有人建议在trainControl中更改pca参数:
mod <- train(Class ~ ., data = Sonar, method = "pls",trControl = ctrl)
如果我在trainControl中更改参数,是否意味着在KNN过程中仍会进行PCA?
是的,如果您这样做:
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3,preProcOptions = list(thresh = 0.9))k = train(True ~ ., method = "knn", tuneGrid = expand.grid(k = 1:10), trControl = train.control, preProcess = c('scale','pca'), metric = "RMSE", data = tr)
您可以在preProcess下检查:
k$preProcessCreated from 15 samples and 20 variablesPre-processing: - centered (20) - ignored (0) - principal component signal extraction (20) - scaled (20)PCA needed 9 components to capture 90 percent of the variance
这将回答第2个问题,即单独使用preProcess:
mdl = preProcess(tr[,-1],method=c("scale","pca"),thresh=0.9)mdlCreated from 15 samples and 20 variablesPre-processing: - centered (20) - ignored (0) - principal component signal extraction (20) - scaled (20)PCA needed 9 components to capture 90 percent of the variancetrain.control = trainControl(method = "repeatedcv", number = 5, repeats=3)k = train(True ~ ., method = "knn", tuneGrid = expand.grid(k = 1:10), trControl = train.control, metric = "RMSE", data = predict(mdl,tr))