背景

我在R语言中使用mlr3包进行建模和预测。我正在处理一个包含测试集和训练集的大数据集。测试集和训练集通过一个指示列（在代码中为test_or_train）来标记。

目标

根据数据集中train_or_test列指示的训练行批量训练所有学习器。
使用相应的已训练学习器，对test_or_train列中标记为“test”的行进行批量预测。

代码

包含测试-训练指示列的占位数据集。（在实际数据中，训练-测试划分不是人为的）
两个任务（在实际代码中，任务是不同的，且数量更多。）

library(readr)library(mlr3)library(mlr3learners)library(mlr3pipelines)library(reprex)library(caret)# Dataurlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'data = read_csv(url(urlfile))[-1]## Create artificial partition to test and train setsart_part = createDataPartition(data$imdb_rating, list=FALSE)train = data[art_part,]test = data[-art_part,]## Add test-train indicatorstrain$test_or_train = 'train'test$test_or_train = 'test'## Data set that I want to work / am working withdata = rbind(test, train)# Create two tasks (Here the tasks are the same but in my data set they differ.)task1 =   TaskRegr$new(    id = 'office1',     backend = data,     target = 'imdb_rating'  )task2 =   TaskRegr$new(    id = 'office2',     backend = data,     target = 'imdb_rating'  )# Model specification graph =   po('scale') %>>%   lrn('regr.cv_glmnet',       id = 'rp',       alpha = 1,       family = 'gaussian'  ) # Learner creationlearner = GraphLearner$new(graph)# Goal ## 1. Batch train all learners with the train rows indicated by the train_or_test column in the data set## 2. Batch predict the rows designated by the 'test' in the test_or_train column with the respective trained learner

^{Created on 2020-06-22 by the reprex package (v0.3.0)}

注意

我尝试使用benchmark_grid和row_ids只训练学习器的训练行，但这不起作用，而且使用列指示器比使用行索引更容易。与行索引相比，使用测试-训练列指示器可以使用一个规则（用于分割），而使用行索引只有在任务包含相同行时才有效。

benchmark_grid(    tasks = list(task1, task2),     learners = learner,     row_ids = train_rows # Not an argument and not favorable to work with indices)

回答：

你可以使用带有自定义设计的benchmark。

以下代码应该能完成任务（请注意，我为每个Task单独实例化了一个自定义的Resampling）。

library(data.table)design = data.table(  task = list(task1, task2),  learner = list(learner))library(mlr3misc)design$resampling = map(design$task, function(x) {  # get train/test split  split = x$data()[["test_or_train"]]  # remove train-test split column from the task  x$select(setdiff(x$feature_names, "test_or_train"))  # instantiate a custom resampling with the given split  rsmp("custom")$instantiate(x,    train_sets = list(which(split == "train")),    test_sets = list(which(split == "test"))  )})benchmark(design)

你能更清楚地说明你所说的batch-processing是什么意思吗？或者这个回答解决了你的问题吗？

学技术

如何根据指示列对任务进行子集划分并在mlr3中批量训练和预测？

背景

目标

代码

注意

发表回复取消回复

背景

目标

代码

注意

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复