我正在尝试对R中数据框的子集列进行独热编码,
独热编码是一种将类别变量转换为可以提供给机器学习算法的形式的过程,通过将字符串列转换为该列中每个字符串的二进制列,从而在预测中表现得更好。
假设我们有一个如下所示的数据框:
mes work_location birth_place01/01/2000 China Chile01/02/2000 Mexico Japan01/03/2000 China Chile01/04/2000 China Argentina01/05/2000 USA Poland01/06/2000 Mexico Poland01/07/2000 USA Finland01/08/2000 USA Finland01/09/2000 Japan Norway01/10/2000 Japan Kenia01/11/2000 Japan Mali01/12/2000 India Mali
以下是进行独热编码的代码:
## 独热编码函数 ##columna_dummy <- function(df, columna) { df %>% mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>% mutate(valor = 1) %>% spread(key = columna, value = valor, fill = 0)}## 选择列 ##columnas <- c("work_location", "birth_place")## 对每个数据框列应用循环以重复columna_dummy函数 ##for(i in 1:length(columnas)){ new_dataset <- columna_dummy(df, i) }
控制台输出:
Error: Problem with `mutate()` input `mes`.x objeto '1' no encontradoi Input `mes` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.Run `rlang::last_error()` to see where the error occurred.Called from: signal_abort(cnd)
列mes
是一个日期类列,然而它并未包含在列的原子向量中,但仍然引发了上述错误,
期望的输出应该大致如下,对于所选字符串数据框列中的每个字符串:
(我无法添加每一个列,但work_location_China是列应该如何显示的一个例子)
mes work_location birth_place work_location_China 01/01/2000 China Chile 101/02/2000 Mexico Japan 001/03/2000 China Chile 101/04/2000 China Argentina 101/05/2000 USA Poland 001/06/2000 Mexico Poland 001/07/2000 USA Finland 001/08/2000 USA Finland 001/09/2000 Japan Norway 001/10/2000 Japan Kenia 001/11/2000 Japan Mali 001/12/2000 India Mali 0
还有其他应用这个循环的方法吗?
回答:
通过使用purrr库,我解决了这个问题:
## 数据 ##df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000", "01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000", "01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China", "Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan", "Japan", "Japan", "India"), birth_place = c("Chile", "Japan", "Chile", "Argentina", "Poland", "Poland", "Finland", "Finland", "Norway", "Kenia", "Mali", "Mali")), class = "data.frame", row.names = c(NA, -12L))## 独热编码函数 ##columna_dummy <- function(df, columna) { df %>% mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>% mutate(valor = 1) %>% spread(key = columna, value = valor, fill = 0)}## 列向量 ##columnas <- c("work_location", "birth_place")## 独热编码数据集 ##library(purrr)hot_encoded_dataset <- purrr :: map(columnas , columna_dummy, df = df) %>% reduce(inner_join)