我试图在预测中使用knn,但首先想进行主成分分析以降低维度。
然而,在我生成主成分并将其应用于knn后,出现了错误,提示
“Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected”
以及警告:
“In addition: Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.”
这是我的样本:
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>% data.frame()
前15个用于训练集
train1 = sample[1:15, ]test = sample[16:20, ]
去除依赖变量
pca.tr=sample[1:15,2:6]pcom = prcomp(pca.tr, scale.=T)pca.tr=data.frame(True=train1[,1], pcom$x)#选择前两个主成分pca.tr = pca.tr[, 1:2]train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)k = train(train1[,1] ~ ., method = "knn", tuneGrid = expand.grid(k = 1:5), trControl = train.control, preProcess='scale', metric = "RMSE", data = cbind(train1[,1], pca.tr))
任何建议都将不胜感激!
回答:
使用更好的列名和不带下标的公式。
你真的应该尝试发布一个可复现的例子。你的部分代码是错误的。
此外,preProc
有一个”pca”方法,它会在重采样内部重新计算PCA分数,从而做适当的事情。
library(caret)#> Loading required package: lattice#> Loading required package: ggplot2#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #> filter, lag#> The following objects are masked from 'package:base':#> #> intersect, setdiff, setequal, unionset.seed(55)sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>% data.frame()train1 = sample[1:15, ]test = sample[16:20, ]pca.tr=sample[1:15,2:6]pcom = prcomp(pca.tr, scale.=T)pca.tr=data.frame(True=train1[,1], pcom$x)#选择前两个主成分pca.tr = pca.tr[, 1:2]dat <- cbind(train1[,1], pca.tr) %>% # This setNames(c("y", "True", "PC1"))train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)set.seed(356)k = train(y ~ ., method = "knn", tuneGrid = expand.grid(k = 1:5), trControl = train.ct, # this argument was wrong in your code preProcess='scale', metric = "RMSE", data = dat)k#> k-Nearest Neighbors #> #> 15 samples#> 2 predictor#> #> Pre-processing: scaled (2) #> Resampling: Cross-Validated (3 fold, repeated 1 times) #> Summary of sample sizes: 11, 10, 9 #> Resampling results across tuning parameters:#> #> k RMSE Rsquared MAE #> 1 4.979826 0.4332661 3.998205#> 2 5.347236 0.3970251 4.312809#> 3 5.016606 0.5977683 3.939470#> 4 4.504474 0.8060368 3.662623#> 5 5.612582 0.5104171 4.500768#> #> RMSE was used to select the optimal model using the smallest value.#> The final value used for the model was k = 4.# or set.seed(356)train(X1 ~ ., method = "knn", tuneGrid = expand.grid(k = 1:5), trControl = train.ct, preProcess= c('pca', 'scale'), metric = "RMSE", data = train1)#> k-Nearest Neighbors #> #> 15 samples#> 5 predictor#> #> Pre-processing: principal component signal extraction (5), scaled#> (5), centered (5) #> Resampling: Cross-Validated (3 fold, repeated 1 times) #> Summary of sample sizes: 11, 10, 9 #> Resampling results across tuning parameters:#> #> k RMSE Rsquared MAE #> 1 13.373189 0.2450736 10.592047#> 2 10.217517 0.2952671 7.973258#> 3 9.030618 0.2727458 7.639545#> 4 8.133807 0.1813067 6.445518#> 5 8.083650 0.2771067 6.551053#> #> RMSE was used to select the optimal model using the smallest value.#> The final value used for the model was k = 5.
Created on 2019-04-15 by the reprex package (v0.2.1)
这些结果在RMSE方面看起来更差,但之前的运行低估了RMSE,因为它假设PCA分数没有变化。