基于重要性的变量减少

我在过滤模型中最不重要的变量时遇到了困难。我收到了一组包含超过4000个变量的数据，并且被要求减少进入模型的变量数量。

我已经尝试了两种方法，但两次都失败了。

我首先尝试的是在建模后手动检查变量的重要性，然后基于此删除不重要的变量。

# reproducible exampledata <- iris# artificial class imbalancingdata <- iris %>%   mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0")))

使用简单的Learner时，一切正常：

# creating Tasktask <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")# creating Learnerlrn <- lrn("classif.xgboost") # setting scoring as prediction type lrn$predict_type = "prob"lrn$train(task)lrn$importance() Petal.Width Petal.Length   0.90606304   0.09393696

问题在于数据高度不平衡，所以我决定使用GraphLearner和PipeOp操作符来对多数组进行欠采样，然后将其传递给AutoTuner：

我跳过了我认为对这个案例不重要的代码部分，如搜索空间、终止器、调谐器等。

# undersamplingpo_under <- po("classbalancing",               id = "undersample", adjust = "major",               reference = "major", shuffle = FALSE, ratio = 1 / 2)# combine learner with pipeline graphlrn_under <- GraphLearner$new(po_under %>% lrn)# setting the autoTunerat <- AutoTuner$new(  learner = lrn_under,  resampling = resample,  measure = measure,  search_space = ps_under,  terminator = terminator,  tuner = tuner)at$train(task)

现在的问题是，尽管at中仍然可以看到重要性属性，但$importance()却不可用。

> at<AutoTuner:undersample.classif.xgboost.tuned>* Model: list* Parameters: list()* Packages: -* Predict Type: prob* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights

所以我决定改变我的方法，尝试在Learner中添加过滤功能。但这让我失败得更彻底。我首先查看了这个mlr3book博客 – https://mlr3book.mlr-org.com/fs.html。我尝试像博客中那样在Learner中添加importance = "impurity"，但这导致了一个错误。

> lrn <- lrn("classif.xgboost", importance = "impurity") Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':  nie można zmienić wartości zablokowanego połączenia dla 'importance'

这基本上意味着类似这样的错误：

Error in 'instance[[nn]] <- dots[[i]]':  can't change value of blocked connection for 'importance'

我还尝试通过PipeOp进行过滤，但也失败得一塌糊涂。我认为没有importance = "impurity"我将无法做到这一点。

所以我的问题是，有没有办法实现我想要的目标？

此外，我将非常感激您解释为什么在建模之前可以根据重要性进行过滤？难道不应该基于模型结果吗？

回答：

你无法访问at变量的$importance的原因是它是一个AutoTuner，它不直接提供变量重要性，只是“包装”了正在调整的实际Learner。

训练后的GraphLearner保存在你的AutoTuner中的$learner下：

# get the trained GraphLearner, with tuned hyperparametersgraphlearner <- at$learner

这个对象也没有$importance()。（理论上，GraphLearner可能包含多个Learner，然后它甚至不知道应该提供哪个重要性！）。

获取实际的LearnerClassifXgboost对象有点繁琐，不幸的是，由于mlr3使用的”R6″对象系统的缺陷：

获取未训练的Learner对象
获取Learner的训练状态并将其放入该对象

# get the untrained Learnerxgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner# put the trained model into the Learnerxgboostlearner$state <- graphlearner$model$classif.xgboost

现在可以查询重要性

xgboostlearner$importance()

你链接的书中的例子在你的案例中不起作用，因为书中使用的是ranger Learner，而你使用的是xgboost。importance = "impurity"是ranger特有的。

学技术

基于重要性的变量减少

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复