以下是我在这里找到的caret
包中的downSample
函数。
downSample <- function(x, y, list = FALSE, yname = "Class") { xc <- class(x) if(!is.data.frame(x)) x <- as.data.frame(x) if(!is.factor(y)) { warning("Down-sampling requires a factor variable as the response. The original data was returned.") return(list(x = x, y = y)) } minClass <- min(table(y)) x$.outcome <- y x <- ddply(x, .(y), function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE], n = minClass) y <- x$.outcome x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE] if(list) { if(xc[1] == "matrix") x <- as.matrix(x) out <- list(x = x, y = y) } else { out <- cbind(x, y) colnames(out)[ncol(out)] <- yname } out }
假设我的数据集是iris
:
data(iris) x <- iris[, -5]y <- iris[, 5]
为了使响应变量成为非常不平衡的二元变量 :
y[-c(130, 146)] <- "setosa"
现在有两个"virginica"
实例和148个"setosa"
实例。我希望修改downSample
函数,以便最终返回的不是最小类别的50%子样本数据集,而是返回例如30%(k)的少数类别和70%的多数类别的子样本数据集。因为使用downSample
函数对于最小类别中的n个实例,它会选择n个实例的其他类别以获得一个完全平衡的数据集。但在我这里,我会丢失很多数据,所以我只想稍微平衡一下,而不是完全平衡。假设k = 20%
,即最终我想要20%的minClass
和80%的其他类别。我已经尝试修改函数的这一部分 :
x <- ddply(x, .(y), function(dat, n) dat[sample(seq(along = dat$.outcome), n),, drop = FALSE], n = minClass)
通过将n
改为4*n
,但我没有成功。出现了以下错误 :
Error in size <= n/2 :comparison (4) is possible only for atomic and list types
您的帮助将不胜感激。
回答:
一种简单的实现方法是更改ddply
调用中的n = minClass
部分。
downSample_custom <- function(x, y, list = FALSE, yname = "Class", frac = 1){ #添加参数frac,其范围在0 - 1之间 xc <- class(x) if(!is.data.frame(x)) x <- as.data.frame(x) if(!is.factor(y)) { warning("Down-sampling requires a factor variable as the response. The original data was returned.") return(list(x = x, y = y)) } minClass <- min(table(y)) x$.outcome <- y x <- ddply(x, .(y), function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE], n = minClass*frac) #将n改为这个 y <- x$.outcome x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE] if(list) { if(xc[1] == "matrix") x <- as.matrix(x) out <- list(x = x, y = y) } else { out <- cbind(x, y) colnames(out)[ncol(out)] <- yname } out}
它是否有效:
library(plyr)
不平衡的y:
set.seed(1)y <- as.factor(sample(c("M", "F"), prob = c(0.1, 0.9), size = 10000, replace = TRUE))x <- rnorm(10000)table(downSample_custom(x, y)[,2])
输出:
F M 1044 1044 table(downSample_custom(x, y, frac = 0.5)[,2])
输出:
F M 522 522 table(downSample_custom(x, y, frac = 0.2)[,2])
输出
F M 208 208
使用frac > 1时会返回错误:
downSample_custom(x, y, frac = 2)
输出
Error in sample.int(length(x), size, replace, prob) :cannot take a sample larger than the population when ‘replace = FALSE’
编辑:对更新后的问题的回答。
例如,可以通过分别采样每个类的索引来实现这一点。以下是一个仅适用于两类问题的示例:
downSample_custom <- function(x, y, yname = "Class", frac = 1){ lev <- levels(y) minClass <- min(table(y)) lev_min <- levels(y)[which.min(table(y))] inds_down <- sample(which(y == lev[lev != lev_min]), size = minClass * frac) #根据minClass * frac采样更多类别的索引 inds_minClass <- which(y == lev[lev == lev_min]) #获取较少类别的所有索引 out <- data.frame(x, y) out <- out[sort(c(inds_down, inds_minClass)),] colnames(out)[ncol(out)] <- yname return(out)}
实践中的效果:
table(downSample_custom(x, y)[,2])
输出:
F M 1044 1044 table(downSample_custom(x, y, frac = 5)[,2])
输出:
F M 5220 1044 head(downSample_custom(x, y, frac = 5))
输出:
x Class1 -1.5163733 F2 0.6291412 F4 1.1797811 M5 1.1176545 F6 -1.2377359 F7 -1.2301645 M