我试图使用带有10折交叉验证的决策树C4.5算法来检测网络垃圾邮件。我的数据集在特征选择后包含8944个观测值和36个变量。
这是我的代码:
#dividing the dataset into train and testtrainRowNumbers<-createDataPartition(final1$spam,p=0.7,list=FALSE)#Create the training datasettrainData<-final1[trainRowNumbers,]#Create Test datatestData<-final1[-trainRowNumbers,]#C4.5 using 10 fold cross validationset.seed(1958)train_control<-createFolds(trainData$spam,k=10)C45Fit<-train(spam~.,method="J48",data=trainData, tuneLength=15, trControl=trainControl( method="cv",indexOut = train_control ))
这是我得到的错误:
C45Fit<-train(spam~.,method="J48",data=trainData, tuneLength=15, trControl=trainControl( method="cv",indexOut = train_control ))
错误在train(spam ~ ., method = “J48”, data = trainData, tuneLength = 15, : 未使用的参数 (method = “J48”, data = trainData, tuneLength = 15, trControl = trainControl(method = “cv”, indexOut = train_control))
我有几个问题:
-
如何解决这个错误?
-
如何设置tuneLength参数?
我的数据集头部:
> head(trainData) hostid host HST_4 HST_6 HST_7 HST_8 HST_9 HST_10 HST_161 0 007cleaningagent.co.uk 0.03370787 1.9791304 0.1123596 0.1516854 0.2247191 0.2977528 0.078651692 1 0800.loan-line.co.uk 1.39539347 2.4222020 0.2284069 0.2610365 0.3531670 0.4529750 0.028790794 3 102belfast.boys-brigade.org.uk 0.29729730 1.1800000 0.2162162 0.3783784 0.5135135 0.5405405 0.216216225 4 10bristol.boys-brigade.org.uk 0.28804348 1.7745267 0.1141304 0.1847826 0.2608696 0.3750000 0.081521746 5 10enfield.boys-brigade.org.uk 0.00000000 0.8468468 0.0625000 0.1875000 0.1875000 0.3125000 0.062500008 8 13thcoventry.co.uk 0.05797101 2.1113074 0.2318841 0.3091787 0.3961353 0.5507246 0.09178744 HST_17 HST_18 HST_20 HMG_29 HMG_40 HMG_41 HMG_42 AVG_50 AVG_51 AVG_55 AVG_571 0.15730337 0.2247191 0.070 0.2907760 0.02702703 0.07207207 0.1351351 32431.65 7.215054 0.02289305 0.29801712 0.05566219 0.1094050 0.075 0.0495162 0.10641628 0.17840376 0.2410016 150592.89 2.000000 0.49661240 0.11374394 0.37837838 0.4054054 0.040 0.2156130 0.03971119 0.11552347 0.1480144 16129.61 2.125000 0.12297815 0.20338775 0.13043478 0.2119565 0.075 0.0405612 0.08152174 0.13043478 0.2119565 28759.75 2.870968 0.19622331 0.06733726 0.18750000 0.2500000 0.005 0.1125400 0.02528090 0.12359551 0.1432584 70966.61 2.000000 0.03948338 0.25137558 0.14975845 0.2512077 0.095 0.1946150 0.04382470 0.10458167 0.1633466 109388.89 11.484940 0.03547817 0.1387366 AVG_58 AVG_59 AVG_61 AVG_63 AVG_65 AVG_67 STD_77 STD_79 STD_80 STD_811 0.030079101 1.888686 0.04982536 0.07119317 0.1539772 0.2237475 0.02240051 0.04634758 0.0003248904 0.076445752 0.005874481 2.423238 0.14016213 0.17484142 0.2460647 0.3279534 0.03014901 0.05352347 0.0006170884 0.094494204 0.017285860 1.657795 0.08748573 0.14192639 0.2273218 0.2815660 0.03715705 0.07385004 0.0021174754 0.157255215 0.007008439 1.656472 0.10088409 0.17370255 0.2791502 0.3839271 0.03382564 0.07695898 0.0011314215 0.142904206 0.017145414 2.284363 0.09245673 0.14045514 0.2267635 0.2907555 0.02459505 0.06418522 0.0007756064 0.165333748 0.001818059 2.300361 0.17326186 0.25910768 0.3351511 0.4479340 0.05611160 0.07531329 0.0005475770 0.15796253 STD_83 STD_84 STD_85 STD_87 STD_94 spam1 0.1219990 0.001009964 0.04043011 0.04198925 0.3400028 normal2 0.1539489 0.001734261 0.15000000 0.16000000 0.3147682 normal4 0.2027374 0.006655953 0.06437500 0.06031250 0.7100778 normal5 0.1925378 0.002708827 0.04258065 0.05290323 0.8195509 normal6 0.2223814 0.005491305 0.09125000 0.08062500 1.2953592 normal8 0.2366591 0.002588343 0.21698795 0.14774096 0.2882247 normal
sessionInfo()的输出
> sessionInfo()R version 3.4.0 (2017-04-21)Platform: x86_64-w64-mingw32/x64 (64-bit)Running under: Windows >= 8 x64 (build 9200)Matrix products: defaultlocale:[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252[4] LC_NUMERIC=C LC_TIME=English_Australia.1252 attached base packages:[1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] bindrcpp_0.2 ggthemes_3.5.0 randomForest_4.6-12 Metrics_0.1.3 RWeka_0.4-37 mlr_2.12.1 [7] ParamHelpers_1.10 rgeos_0.3-26 VIM_4.7.0 data.table_1.10.4-3 colorspace_1.3-2 mice_2.46.0 [13] RANN_2.5.1 kernlab_0.9-25 mlbench_2.1-1 caret_6.0-79 ggplot2_2.2.1 lattice_0.20-35 [19] dplyr_0.7.4 loaded via a namespace (and not attached): [1] nlme_3.1-131 lubridate_1.7.3 bit64_0.9-7 dimRed_0.1.0 httr_1.3.1 backports_1.1.2 tools_3.4.0 [8] R6_2.2.2 rpart_4.1-11 DBI_0.8 lazyeval_0.2.1 nnet_7.3-12 withr_2.1.0 sp_1.2-7 [15] tidyselect_0.2.3 mnormt_1.5-5 parallelMap_1.3 bit_1.1-12 curl_3.0 compiler_3.4.0 checkmate_1.8.5 [22] scales_0.5.0 sfsmisc_1.1-1 DEoptimR_1.0-8 lmtest_0.9-35 psych_1.7.8 robustbase_0.92-8 stringr_1.2.0 [29] foreign_0.8-67 rio_0.5.10 pkgconfig_2.0.1 RWekajars_3.9.2-1 rlang_0.2.0 readxl_1.0.0 ddalpha_1.3.1 [36] BBmisc_1.11 bindr_0.1 zoo_1.8-0 ModelMetrics_1.1.0 car_3.0-0 magrittr_1.5 Matrix_1.2-12 [43] Rcpp_0.12.14 munsell_0.4.3 abind_1.4-5 stringi_1.1.6 carData_3.0-1 MASS_7.3-47 plyr_1.8.4 [50] recipes_0.1.1 parallel_3.4.0 forcats_0.3.0 haven_1.1.1 splines_3.4.0 pillar_1.2.1 boot_1.3-19 [57] rjson_0.2.15 reshape2_1.4.2 codetools_0.2-15 stats4_3.4.0 CVST_0.2-1 glue_1.2.0 laeken_0.4.6 [64] vcd_1.4-4 foreach_1.4.3 twitteR_1.1.9 cellranger_1.1.0 gtable_0.2.0 purrr_0.2.4 tidyr_0.7.2 [71] assertthat_0.2.0 DRR_0.0.2 gower_0.1.2 openxlsx_4.0.17 prodlim_1.6.1 broom_0.4.3 e1071_1.6-8 [78] class_7.3-14 survival_2.41-3 timeDate_3042.101 RcppRoll_0.2.2 tibble_1.4.2 rJava_0.9-9 iterators_1.0.8 [85] lava_1.5.1 ipred_0.9-6
提前感谢任何提供的建议。
回答:
我可以通过以下方式复制错误消息:
library(RWeka)library(caret)library(mlr)# Loading required package: ParamHelpers# Attaching package: ‘mlr’# The following object is masked from ‘package:caret’:# train#dividing the dataset into train and testtrainRowNumbers <- createDataPartition(iris$Species, p = 0.7, list = FALSE)#Create the training datasettrainData <- iris[trainRowNumbers, ]#Create Test datatestData <- iris[-trainRowNumbers, ]#C4.5 using 10 fold cross validationset.seed(1958)train_control <- createFolds(trainData$Species, k = 10)C45Fit <- train(Species~., method = "J48",data = trainData, tuneLength = 15, trControl = trainControl( method = "cv",indexOut = train_control ))# Error in train(Species ~ ., method = "J48", data = trainData, tuneLength = 15, : # unused arguments (method = "J48", data = trainData, tuneLength = 15, trControl = trainControl(method = "cv", indexOut = train_control))
注意消息The following object is masked from ‘package:caret’: train
。如果你在加载caret
后加载了另一个包含train
函数的包(例如本例中的mlr
),R默认会使用最近加载的包中的train
函数。(这就是我请求sessionInfo()
的原因,以便查看加载了哪些包。出于同样的原因,可复制的示例应包括你加载的包。)R不是运行caret
中的train
,而是运行mlr
(或你加载的其他包)中的train
,这会返回错误消息。
解决方案是最后加载caret
,或者明确调用caret
中的train
函数,使用caret::train(...)
。