标题: knn + pca, 选择了未定义的列

我试图在预测中使用knn，但首先想进行主成分分析以降低维度。

然而，在我生成主成分并将其应用于knn后，出现了错误，提示

“Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected”

以及警告:

“In addition: Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.”

这是我的样本:

sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%  data.frame()

前15个用于训练集

train1 = sample[1:15, ]test = sample[16:20, ]

去除依赖变量

pca.tr=sample[1:15,2:6]pcom = prcomp(pca.tr, scale.=T)pca.tr=data.frame(True=train1[,1], pcom$x)#选择前两个主成分pca.tr = pca.tr[, 1:2]train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)k = train(train1[,1] ~ .,          method     = "knn",          tuneGrid   = expand.grid(k = 1:5),          trControl  = train.control, preProcess='scale',          metric     = "RMSE",          data       = cbind(train1[,1], pca.tr))

任何建议都将不胜感激！

回答：

使用更好的列名和不带下标的公式。

你真的应该尝试发布一个可复现的例子。你的部分代码是错误的。

此外，preProc有一个”pca”方法，它会在重采样内部重新计算PCA分数，从而做适当的事情。

library(caret)#> Loading required package: lattice#> Loading required package: ggplot2#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #>     filter, lag#> The following objects are masked from 'package:base':#> #>     intersect, setdiff, setequal, unionset.seed(55)sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%  data.frame()train1 = sample[1:15, ]test = sample[16:20, ]pca.tr=sample[1:15,2:6]pcom = prcomp(pca.tr, scale.=T)pca.tr=data.frame(True=train1[,1], pcom$x)#选择前两个主成分pca.tr = pca.tr[, 1:2]dat <- cbind(train1[,1], pca.tr) %>%   # This  setNames(c("y", "True", "PC1"))train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)set.seed(356)k = train(y ~ .,          method     = "knn",          tuneGrid   = expand.grid(k = 1:5),          trControl  = train.ct, # this argument was wrong in your code          preProcess='scale',          metric     = "RMSE",          data       = dat)k#> k-Nearest Neighbors #> #> 15 samples#>  2 predictor#> #> Pre-processing: scaled (2) #> Resampling: Cross-Validated (3 fold, repeated 1 times) #> Summary of sample sizes: 11, 10, 9 #> Resampling results across tuning parameters:#> #>   k  RMSE      Rsquared   MAE     #>   1  4.979826  0.4332661  3.998205#>   2  5.347236  0.3970251  4.312809#>   3  5.016606  0.5977683  3.939470#>   4  4.504474  0.8060368  3.662623#>   5  5.612582  0.5104171  4.500768#> #> RMSE was used to select the optimal model using the smallest value.#> The final value used for the model was k = 4.# or set.seed(356)train(X1 ~ .,      method     = "knn",      tuneGrid   = expand.grid(k = 1:5),      trControl  = train.ct,       preProcess= c('pca', 'scale'),      metric     = "RMSE",      data       = train1)#> k-Nearest Neighbors #> #> 15 samples#>  5 predictor#> #> Pre-processing: principal component signal extraction (5), scaled#>  (5), centered (5) #> Resampling: Cross-Validated (3 fold, repeated 1 times) #> Summary of sample sizes: 11, 10, 9 #> Resampling results across tuning parameters:#> #>   k  RMSE       Rsquared   MAE      #>   1  13.373189  0.2450736  10.592047#>   2  10.217517  0.2952671   7.973258#>   3   9.030618  0.2727458   7.639545#>   4   8.133807  0.1813067   6.445518#>   5   8.083650  0.2771067   6.551053#> #> RMSE was used to select the optimal model using the smallest value.#> The final value used for the model was k = 5.

^{Created on 2019-04-15 by the reprex package (v0.2.1)}

这些结果在RMSE方面看起来更差，但之前的运行低估了RMSE，因为它假设PCA分数没有变化。

学技术

标题: knn + pca, 选择了未定义的列

发表回复取消回复

相关文章：

Related Posts

为什么我们在K-means聚类方法中使用kmeans.fit函数？

如何获取Keras中ImageDataGenerator的.flow_from_directory函数扫描的类名？

如何查看每个词的tf-idf得分

如何修复 ‘ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]’？

如何向神经网络输入两个不同大小的输入？

逻辑回归与机器学习有何关联

发表回复 取消回复

发表回复取消回复