xgboost泊松回归:标签必须为非负数

我使用的是Windows 10笔记本电脑,R和xgboost版本为0.6-4。运行以下代码时出现了一个奇怪的错误。

xgb_params <- list("objective" = "count:poisson",                "eval_metric" = "rmse") regression <- xgboost(data = training_fold,                    label = y_training_fold,                    nrounds = 10,                   params = xgb_params)Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :amalgamation/../src/objective/regression_obj.cc:190: Check failed: label_correct PoissonRegression: label must be nonnegative

但是当我查看标签的摘要时,它显示为:

Min.   1st Qu. Median  Mean   3rd Qu. Max.   NA's0.1129 0.3387  0.7000  1.0987 1.5265  4.5405 287

我该如何解决这个问题?我尝试删除了NA值,但这并没有帮助。

提前感谢!

编辑

这里是训练数据的一个样本

dput(droplevels(head(train[, c(1,2,4,5,6,8,9,10,11)], 20)))structure(list(VacancyId = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("55288","56838", "57822", "57902", "57925", "58008"), class = "factor"), VacancyBankId = c(2L, 1609L, 1611L, 147L, 17L, 1611L, 2L, 257L, 1611L, 2L, 147L, 17L, 1611L, 239L, 1609L, 2L, 1609L, 2L, 2L, 1609L), FunctionId = c(36L, 36L, 36L, 36L, 35L, 35L, 3L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 3L, 3L, 3L, 3L, 3L, 3L), EducationLevel = c(6L, 6L, 6L, 6L, 6L, 6L, 4L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L), ProvinceId = c(22L, 22L, 22L, 22L, 24L, 24L, 19L, 16L, 16L, 16L, 16L, 19L, 19L, 19L, 21L, 21L, 16L, 16L, 22L, 22L), CandidatesCount = c(126L, 27L, 18L, 12L, 1L, 4L, 2L, 6L, 7L, 7L, 1L, 8L, 15L, 13L, 7L, 7L, 7L, 7L, 7L, 7L), DurationDays = c(62L, 62L, 62L, 62L, 18L, 18L, 43L, 61L, 61L, 61L, 61L, 60L, 60L, 60L, 62L, 62L, 62L, 62L, 62L, 62L), DurationWeeks = c(8.857142857, 8.857142857, 8.857142857, 8.857142857, 2.571428571, 2.571428571, 6.142857143, 8.714285714, 8.714285714, 8.714285714, 8.714285714, 8.571428571, 8.571428571, 8.571428571, 8.857142857, 8.857142857, 8.857142857, 8.857142857, 8.857142857, 8.857142857), CandidatesPerWeek = c(NA, 3.048387097, 2.032258065, 1.35483871, 0.388888889, 1.555555556, 0.325581395, 0.68852459, 0.803278689, 0.803278689, 0.114754098, 0.933333333, 1.75, 1.516666667, 0.790322581, 0.790322581, 0.790322581, 0.790322581, 0.790322581, 0.790322581)), .Names = c("VacancyId", "VacancyBankId", "FunctionId", "EducationLevel", "ProvinceId", "CandidatesCount", "DurationDays", "DurationWeeks", "CandidatesPerWeek"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 26L, 27L, 28L, 29L, 30L, 31L), class = "data.frame")

我想使用职能ID、教育水平、省份和职位银行ID来预测每周的候选人数。因此,y_training_fold是每周的候选人数,而training_fold是职能、教育、省份和职位银行ID。

希望有人能帮帮我!


回答:

你的数据集中问题不在于y_training_fold中有负值,而在于存在非整数值。
请看以下使用非整数值的y_training_fold向量的模拟:

library(xgboost)training_fold <- matrix(rnorm(1000),nrow=100)y_training_fold <- matrix(rnorm(100),ncol=1)xgb_params <- list("objective" = "count:poisson",                "eval_metric" = "rmse") regression <- xgboost(data = training_fold,                    label = y_training_fold,                    nrounds = 10,                   params = xgb_params)

错误消息与你报告的完全相同:

Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :   [11:46:28] amalgamation/../src/objective/regression_obj.cc:190:   Check failed: label_correct PoissonRegression: label must be nonnegative

现在,尝试使用整数值的y_training_fold向量:

y_training_fold <- matrix(rpois(100,10),ncol=1)xgb_params <- list("objective" = "count:poisson",                "eval_metric" = "rmse")regression <- xgboost(data = training_fold,                    label = y_training_fold,                    nrounds = 10,                   params = xgb_params)

现在xgboost运行得很好:

[1]     train-rmse:9.795855 [2]     train-rmse:9.660112 [3]     train-rmse:9.492991 [4]     train-rmse:9.287366 [5]     train-rmse:9.034582 [6]     train-rmse:8.724205 [7]     train-rmse:8.343800 [8]     train-rmse:7.878869 [9]     train-rmse:7.312294 [10]    train-rmse:6.632671

编辑

使用你的数据,解决问题的方法是:

dts <- structure(list(VacancyId = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("55288","56838", "57822", "57902", "57925", "58008"), class = "factor"), VacancyBankId = c(2L, 1609L, 1611L, 147L, 17L, 1611L, 2L, 257L, 1611L, 2L, 147L, 17L, 1611L, 239L, 1609L, 2L, 1609L, 2L, 2L, 1609L), FunctionId = c(36L, 36L, 36L, 36L, 35L, 35L, 3L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 3L, 3L, 3L, 3L, 3L, 3L), EducationLevel = c(6L, 6L, 6L, 6L, 6L, 6L, 4L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L), ProvinceId = c(22L, 22L, 22L, 22L, 24L, 24L, 19L, 16L, 16L, 16L, 16L, 19L, 19L, 19L, 21L, 21L, 16L, 16L, 22L, 22L), CandidatesCount = c(126L, 27L, 18L, 12L, 1L, 4L, 2L, 6L, 7L, 7L, 1L, 8L, 15L, 13L, 7L, 7L, 7L, 7L, 7L, 7L), DurationDays = c(62L, 62L, 62L, 62L, 18L, 18L, 43L, 61L, 61L, 61L, 61L, 60L, 60L, 60L, 62L, 62L, 62L, 62L, 62L, 62L), DurationWeeks = c(8.857142857, 8.857142857, 8.857142857, 8.857142857, 2.571428571, 2.571428571, 6.142857143, 8.714285714, 8.714285714, 8.714285714, 8.714285714, 8.571428571, 8.571428571, 8.571428571, 8.857142857, 8.857142857, 8.857142857, 8.857142857, 8.857142857, 8.857142857), CandidatesPerWeek = c(NA, 3.048387097, 2.032258065, 1.35483871, 0.388888889, 1.555555556, 0.325581395, 0.68852459, 0.803278689, 0.803278689, 0.114754098, 0.933333333, 1.75, 1.516666667, 0.790322581, 0.790322581, 0.790322581, 0.790322581, 0.790322581, 0.790322581)), .Names = c("VacancyId", "VacancyBankId", "FunctionId", "EducationLevel", "ProvinceId", "CandidatesCount", "DurationDays", "DurationWeeks", "CandidatesPerWeek"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 26L, 27L, 28L, 29L, 30L, 31L), class = "data.frame")# 删除缺失值dts <- na.omit(dts)# 构建潜在预测变量的X矩阵# 重要:不要使用第一列(ID)和最后一列(响应变量)training_fold <- as.matrix(dts[,-c(1,9)])# 将响应变量四舍五入到最接近的整数y_training_fold <- as.matrix(dts[,9])y_training_fold <- round(y_training_fold)xgb_params <- list("objective" = "count:poisson",                "eval_metric" = "rmse")( regression <- xgboost(data = training_fold,                    label = y_training_fold,                    nrounds = 10,                   params = xgb_params) )# 输出##### xgb.Booster# raw: 4.6 Kb # call:#   xgb.train(params = params, data = dtrain, nrounds = nrounds, #     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, #     early_stopping_rounds = early_stopping_rounds, maximize = maximize, #     save_period = save_period, save_name = save_name, xgb_model = xgb_model, #     callbacks = callbacks)# params (as set within xgb.train):#   objective = "count:poisson", eval_metric = "rmse", silent = "1"# xgb.attributes:#   niter# callbacks:#   cb.print.evaluation(period = print_every_n)#   cb.evaluation.log()#   cb.save.model(save_period = save_period, save_name = save_name)# niter: 10# evaluation_log:#     iter train_rmse#        1   0.914084#        2   0.829741# ---                #        9   0.332951#       10   0.291877

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注