我一直在尝试使用MNIST数字识别数据集,但现在有点卡住了。我阅读了一些研究论文,并实施了我所理解的内容。基本上,我首先创建了我的训练集和交叉验证集来评估我的分类器,然后我对测试集和训练集都进行了PCA处理,之后我使用KNN和SVM来执行分类任务。我面临的主要问题是,我应该先对整个数据集进行PCA处理,然后再分离我的训练集和交叉验证集,还是先分离它们,然后分别对交叉验证集和训练集进行PCA处理。我为询问我已经尝试过的事情而道歉,因为我已经尝试了这两种情况,在第一种情况下,我的分类器表现得非常出色,我猜这是因为PCA在创建主成分时使用了测试数据集,这调整了我的结果,可能是模型偏差的原因,在另一种情况下,性能大约在20%到30%之间,这非常低。所以我有点卡住了,不知道该如何改进我的模型,任何帮助和指导都非常感激,我在下面粘贴了我的代码供参考。
library(ggplot2)library(e1071)library(ElemStatLearn)library(plyr)library(class)import.csv <- function(filename){ return(read.csv(filename, sep = ",", header = TRUE, stringsAsFactors = FALSE))}train.data <- import.csv("train.csv")test.data <- train.data[30001:32000,]train.data <- train.data[1:6000,]#Performing PCA on the dataset to reduce the dimensionality of the dataget_PCA <- function(dataset){ dataset.features <- dataset[,!(colnames(dataset) %in% c("label"))] features.unit.variance <- names(dataset[, sapply(dataset, function(v) var(v, na.rm=TRUE)==0)]) dataset.features <- dataset[,!(colnames(dataset) %in% features.unit.variance)] pr.comp <- prcomp(dataset.features, retx = T, center = T, scale = T) #finding the total variance contained in the principal components prin_comp <- summary(pr.comp) prin_comp.sdev <- data.frame(prin_comp$sdev) #print(paste0("%age of variance contained = ", sum(prin_comp.sdev[1:500,])/sum(prin_comp.sdev))) screeplot(pr.comp, type = "lines", main = "Principal Components") num.of.comp = 50 red.dataset <- prin_comp$x red.dataset <- red.dataset[,1:num.of.comp] red.dataset <- data.frame(red.dataset) return(red.dataset)}#Perform k-fold cross validation do_cv_class <- function(df, k, classifier){ num_of_nn = gsub("[^[:digit:]]","",classifier) classifier = gsub("[[:digit:]]","",classifier) if(num_of_nn == "") { classifier = c("get_pred_",classifier) } else { classifier = c("get_pred_k",classifier) num_of_nn = as.numeric(num_of_nn) } classifier = paste(classifier,collapse = "") func_name <- classifier output = vector() size_distr = c() n = nrow(df) for(i in 1:n) { a = 1 + (((i-1) * n)%/%k) b = ((i*n)%/%k) size_distr = append(size_distr, b - a + 1) } row_num = 1:n sampling = list() for(i in 1:k) { s = sample(row_num,size_distr) sampling[[i]] = s row_num = setdiff(row_num,s) } prediction.df = data.frame() outcome.list = list() for(i in 1:k) { testSample = sampling[[i]] train_set = df[-testSample,] test_set = df[testSample,] if(num_of_nn == "") { classifier = match.fun(classifier) result = classifier(train_set,test_set) confusion.matrix <- table(pred = result, true = test_set$label) accuracy <- sum(diag(confusion.matrix)*100)/sum(confusion.matrix) print(confusion.matrix) outcome <- list(sample_ID = i, Accuracy = accuracy) outcome.list <- rbind(outcome.list, outcome) } else { classifier = match.fun(classifier) result = classifier(train_set,test_set) print(class(result)) confusion.matrix <- table(pred = result, true = test_set$label) accuracy <- sum(diag(confusion.matrix)*100)/sum(confusion.matrix) print(confusion.matrix) outcome <- list(sample_ID = i, Accuracy = accuracy) outcome.list <- rbind(outcome.list, outcome) } } return(outcome.list)}#Support Vector Machines with linear kernelget_pred_svm <- function(train, test){ digit.class.train <- as.factor(train$label) train.features <- train[,-train$label] test.features <- test[,-test$label] svm.model <- svm(train.features, digit.class.train, cost = 10, gamma = 0.0001, kernel = "radial") svm.pred <- predict(svm.model, test.features) return(svm.pred)}#KNN modelget_pred_knn <- function(train,test){ digit.class.train <- as.factor(train$label) train.features <- train[,!colnames(train) %in% "label"] test.features <- test[,!colnames(train) %in% "label"] knn.model <- knn(train.features, test.features, digit.class.train) return(knn.model)}
========================================================================
回答:
将PCA视为你对数据应用的一种变换。你希望保持两点:
- 由于测试集模拟了“现实世界”中的情况,你会得到之前未见过的样本,因此你不能将测试集用于除评估分类器之外的任何事情。
- 你需要对所有样本应用相同的变换。
因此,你需要对训练集应用PCA,保留变换数据,这包括两部分信息:
- 你从样本中减去的均值,以便将它们中心化。
- 变换矩阵,即协方差矩阵的特征向量
并对测试集应用相同的变换。