背景
我在R语言中使用mlr3包进行建模和预测。我正在处理一个包含测试集和训练集的大数据集。测试集和训练集通过一个指示列(在代码中为test_or_train)来标记。
目标
- 根据数据集中train_or_test列指示的训练行批量训练所有学习器。
- 使用相应的已训练学习器,对test_or_train列中标记为“test”的行进行批量预测。
代码
- 包含测试-训练指示列的占位数据集。(在实际数据中,训练-测试划分不是人为的)
- 两个任务(在实际代码中,任务是不同的,且数量更多。)
library(readr)library(mlr3)library(mlr3learners)library(mlr3pipelines)library(reprex)library(caret)# Dataurlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'data = read_csv(url(urlfile))[-1]## Create artificial partition to test and train setsart_part = createDataPartition(data$imdb_rating, list=FALSE)train = data[art_part,]test = data[-art_part,]## Add test-train indicatorstrain$test_or_train = 'train'test$test_or_train = 'test'## Data set that I want to work / am working withdata = rbind(test, train)# Create two tasks (Here the tasks are the same but in my data set they differ.)task1 = TaskRegr$new( id = 'office1', backend = data, target = 'imdb_rating' )task2 = TaskRegr$new( id = 'office2', backend = data, target = 'imdb_rating' )# Model specification graph = po('scale') %>>% lrn('regr.cv_glmnet', id = 'rp', alpha = 1, family = 'gaussian' ) # Learner creationlearner = GraphLearner$new(graph)# Goal ## 1. Batch train all learners with the train rows indicated by the train_or_test column in the data set## 2. Batch predict the rows designated by the 'test' in the test_or_train column with the respective trained learner
Created on 2020-06-22 by the reprex package (v0.3.0)
注意
我尝试使用benchmark_grid和row_ids只训练学习器的训练行,但这不起作用,而且使用列指示器比使用行索引更容易。与行索引相比,使用测试-训练列指示器可以使用一个规则(用于分割),而使用行索引只有在任务包含相同行时才有效。
benchmark_grid( tasks = list(task1, task2), learners = learner, row_ids = train_rows # Not an argument and not favorable to work with indices)
回答:
你可以使用带有自定义设计的benchmark
。
以下代码应该能完成任务(请注意,我为每个Task
单独实例化了一个自定义的Resampling
)。
library(data.table)design = data.table( task = list(task1, task2), learner = list(learner))library(mlr3misc)design$resampling = map(design$task, function(x) { # get train/test split split = x$data()[["test_or_train"]] # remove train-test split column from the task x$select(setdiff(x$feature_names, "test_or_train")) # instantiate a custom resampling with the given split rsmp("custom")$instantiate(x, train_sets = list(which(split == "train")), test_sets = list(which(split == "test")) )})benchmark(design)
你能更清楚地说明你所说的batch-processing
是什么意思吗?或者这个回答解决了你的问题吗?