标题: knn + pca, 选择了未定义的列

我试图在预测中使用knn,但首先想进行主成分分析以降低维度。

然而,在我生成主成分并将其应用于knn后,出现了错误,提示

“Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected”

以及警告:

“In addition: Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.”

这是我的样本:

sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%  data.frame()

前15个用于训练集

train1 = sample[1:15, ]test = sample[16:20, ]

去除依赖变量

pca.tr=sample[1:15,2:6]pcom = prcomp(pca.tr, scale.=T)pca.tr=data.frame(True=train1[,1], pcom$x)#选择前两个主成分pca.tr = pca.tr[, 1:2]train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)k = train(train1[,1] ~ .,          method     = "knn",          tuneGrid   = expand.grid(k = 1:5),          trControl  = train.control, preProcess='scale',          metric     = "RMSE",          data       = cbind(train1[,1], pca.tr))

任何建议都将不胜感激!


回答:

使用更好的列名和不带下标的公式。

你真的应该尝试发布一个可复现的例子。你的部分代码是错误的。

此外,preProc有一个”pca”方法,它会在重采样内部重新计算PCA分数,从而做适当的事情。

library(caret)#> Loading required package: lattice#> Loading required package: ggplot2#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #>     filter, lag#> The following objects are masked from 'package:base':#> #>     intersect, setdiff, setequal, unionset.seed(55)sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%  data.frame()train1 = sample[1:15, ]test = sample[16:20, ]pca.tr=sample[1:15,2:6]pcom = prcomp(pca.tr, scale.=T)pca.tr=data.frame(True=train1[,1], pcom$x)#选择前两个主成分pca.tr = pca.tr[, 1:2]dat <- cbind(train1[,1], pca.tr) %>%   # This  setNames(c("y", "True", "PC1"))train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)set.seed(356)k = train(y ~ .,          method     = "knn",          tuneGrid   = expand.grid(k = 1:5),          trControl  = train.ct, # this argument was wrong in your code          preProcess='scale',          metric     = "RMSE",          data       = dat)k#> k-Nearest Neighbors #> #> 15 samples#>  2 predictor#> #> Pre-processing: scaled (2) #> Resampling: Cross-Validated (3 fold, repeated 1 times) #> Summary of sample sizes: 11, 10, 9 #> Resampling results across tuning parameters:#> #>   k  RMSE      Rsquared   MAE     #>   1  4.979826  0.4332661  3.998205#>   2  5.347236  0.3970251  4.312809#>   3  5.016606  0.5977683  3.939470#>   4  4.504474  0.8060368  3.662623#>   5  5.612582  0.5104171  4.500768#> #> RMSE was used to select the optimal model using the smallest value.#> The final value used for the model was k = 4.# or set.seed(356)train(X1 ~ .,      method     = "knn",      tuneGrid   = expand.grid(k = 1:5),      trControl  = train.ct,       preProcess= c('pca', 'scale'),      metric     = "RMSE",      data       = train1)#> k-Nearest Neighbors #> #> 15 samples#>  5 predictor#> #> Pre-processing: principal component signal extraction (5), scaled#>  (5), centered (5) #> Resampling: Cross-Validated (3 fold, repeated 1 times) #> Summary of sample sizes: 11, 10, 9 #> Resampling results across tuning parameters:#> #>   k  RMSE       Rsquared   MAE      #>   1  13.373189  0.2450736  10.592047#>   2  10.217517  0.2952671   7.973258#>   3   9.030618  0.2727458   7.639545#>   4   8.133807  0.1813067   6.445518#>   5   8.083650  0.2771067   6.551053#> #> RMSE was used to select the optimal model using the smallest value.#> The final value used for the model was k = 5.

Created on 2019-04-15 by the reprex package (v0.2.1)

这些结果在RMSE方面看起来更差,但之前的运行低估了RMSE,因为它假设PCA分数没有变化。

Related Posts

为什么我们在K-means聚类方法中使用kmeans.fit函数?

我在一个视频中使用K-means聚类技术,但我不明白为…

如何获取Keras中ImageDataGenerator的.flow_from_directory函数扫描的类名?

我想制作一个用户友好的GUI图像分类器,用户只需指向数…

如何查看每个词的tf-idf得分

我试图了解文档中每个词的tf-idf得分。然而,它只返…

如何修复 ‘ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]’?

我在制作一个用于情感分析的逻辑回归模型时遇到了这个问题…

如何向神经网络输入两个不同大小的输入?

我想向神经网络输入两个数据集。第一个数据集(元素)具有…

逻辑回归与机器学习有何关联

我们正在开会讨论聘请一位我们信任的顾问来做机器学习。一…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注