概述:
我使用tidymodels包和数据框FID(见下文)生成了四个模型:
- 一般线性模型
- 装袋树
- 随机森林
- 提升树
该数据框包含三个预测变量:
- 年份(数值型)
- 月份(因子型)
- 天数(数值型)
因变量是频率(数值型)
问题
我在尝试拟合装袋树模型时遇到了以下错误信息:
为什么我在使用bag_tree()和fit_resamples()时会出现错误呢?
网上几乎没有相关资料,我只找到了这个帖子;然而,这个问题与逻辑回归有关,而不是装袋树模型。
x Fold01: model: 错误: 输入必须是向量,而不能是NULL.x Fold02: model: 错误: 输入必须是向量,而不能是NULL.x Fold03: model: 错误: 输入必须是向量,而不能是NULL.x Fold04: model: 错误: 输入必须是向量,而不能是NULL.x Fold05: model: 错误: 输入必须是向量,而不能是NULL.x Fold06: model: 错误: 输入必须是向量,而不能是NULL.x Fold07: model: 错误: 输入必须是向量,而不能是NULL.x Fold08: model: 错误: 输入必须是向量,而不能是NULL.x Fold09: model: 错误: 输入必须是向量,而不能是NULL.x Fold10: model: 错误: 输入必须是向量,而不能是NULL.警告信息:所有模型在[fit_resamples()]中失败。请查看`.notes`列。
如果有人能帮助解决这个错误信息,我将非常感激您的建议。
提前感谢
R代码
##打开tidymodels包library(tidymodels)library(glmnet)library(parsnip)library(rpart.plot)library(rpart)library(tidyverse) # 操作数据library(skimr) # 数据可视化library(baguette) # 装袋树library(future) # 并行处理 & 减少计算时间library(xgboost) # 提升树library(ranger)library(yardstick)library(purrr)library(forcats)library(rlang)library(poissonreg)#将单个数据集拆分为两个:训练集和测试集data_split <- initial_split(FID)# 为两个集合创建数据框:train_data <- training(data_split)test_data <- testing(data_split)# 使用10折交叉验证对数据进行重新采样(默认10折)cv <- vfold_cv(train_data, v=10)#############################################################生成配方rec <- recipe(Frequency ~ ., data = FID) %>% step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # 移除方差为零的变量 step_novel(all_nominal()) %>% # 准备测试数据以处理之前未见过的因子水平 step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # 用中位数替换缺失的数值观测值 step_dummy(all_nominal(), -has_role("id vars")) # 对分类变量进行虚拟编码#####装袋树mod_bag <- bag_tree() %>% set_mode("regression") %>% set_engine("rpart", times = 10) #10次自助重抽样##更新模型以包含成本复杂性 ##一个正数作为成本/复杂性参数,##成本/复杂性参数Updated_bag<-update(mod_bag, cost_complexity=1)##创建工作流程wflow_bag <- workflow() %>% add_recipe(rec) %>% add_model(Updated_bag)##拟合和预测一般线性模型bag_fit_model <- fit(wflow_bag, data = train_data)##我们可以使用pull_workflow_fit()访问拟合结果,甚至##可以使用tidy()将模型系数结果整理成方便的数据框格式。##STACKOVERFLOWbag_fit_model %>% pull_workflow_fit() ##预测模型bag_predict<-predict(bag_fit_model, train_data)##拟合模型plan(multisession)fit_bag <- fit_resamples( wflow_bag, cv, metrics = metric_set(rmse, rsq), control = control_resamples(save_pred = TRUE, extract = function(x) extract_model(x)))x Fold01: model: 错误: 输入必须是向量,而不能是NULL.x Fold02: model: 错误: 输入必须是向量,而不能是NULL.x Fold03: model: 错误: 输入必须是向量,而不能是NULL.x Fold04: model: 错误: 输入必须是向量,而不能是NULL.x Fold05: model: 错误: 输入必须是向量,而不能是NULL.x Fold06: model: 错误: 输入必须是向量,而不能是NULL.x Fold07: model: 错误: 输入必须是向量,而不能是NULL.x Fold08: model: 错误: 输入必须是向量,而不能是NULL.x Fold09: model: 错误: 输入必须是向量,而不能是NULL.x Fold10: model: 错误: 输入必须是向量,而不能是NULL.警告信息:所有模型在[fit_resamples()]中失败。请查看`.notes`列。
数据框 – FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), class = "factor"), Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")
回答:
决策树的cost_complexity
有时被称为alpha
,它应该是一个小于一的正数。当您设置cost_complexity
小于一时,您的模型可以正常运行:
library(tidymodels)library(baguette)FID <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), class = "factor"), Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")#将单个数据集拆分为两个:训练集和测试集data_split <- initial_split(FID)# 为两个集合创建数据框:train_data <- training(data_split)test_data <- testing(data_split)# 使用10折交叉验证对数据进行重新采样(默认10折)cv <- vfold_cv(train_data, v = 10)rec <- recipe(Frequency ~ ., data = FID) %>% step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # 移除方差为零的变量 step_novel(all_nominal()) %>% # 准备测试数据以处理之前未见过的因子水平 step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # 用中位数替换缺失的数值观测值 step_dummy(all_nominal(), -has_role("id vars")) # 对分类变量进行虚拟编码mod_bag <- bag_tree(cost_complexity = 0.1) %>% set_mode("regression") %>% set_engine("rpart", times = 10) #10次自助重抽样wflow_bag <- workflow() %>% add_recipe(rec) %>% add_model(mod_bag)fit(wflow_bag, data = train_data)#> ══ Workflow [trained] ══════════════════════════════════════════════════════════#> Preprocessor: Recipe#> Model: bag_tree()#> #> ── Preprocessor ────────────────────────────────────────────────────────────────#> 4 Recipe Steps#> #> ● step_nzv()#> ● step_novel()#> ● step_medianimpute()#> ● step_dummy()#> #> ── Model ───────────────────────────────────────────────────────────────────────#> Bagged CART (regression with 10 members)#> #> Variable importance scores include:#> #> # A tibble: 12 x 4#> term value std.error used#> <chr> <dbl> <dbl> <int>#> 1 Days 4922. 369. 10#> 2 Month_June 2253. 260. 9#> 3 Month_July 1375. 139. 8#> 4 Month_November 306. 96.4 3#> 5 Year 272. 519. 2#> 6 Month_May 270. 103. 4#> 7 Month_February 191. 116. 4#> 8 Month_August 105. 30.2 3#> 9 Month_April 45.8 42.5 2#> 10 Month_September 13.4 0 1#> 11 Month_December 11.9 0 1#> 12 Month_March 10.1 0 1
由reprex包(v0.3.0.9001)于2020-12-17创建
我猜您尝试了1的值,因为文档中这里显示了这个值,这非常误导。我们会修复这个问题的。