我目前正在使用免费的UCI乳腺癌.arff
文件练习WEKA建模,通过这里的各种帖子,我能够将其准确性调整到63%到73%之间。我使用的是在Windows 7 Starter机器上的WEKA 3.7.10
版本。
-
我使用属性选择来减少变量的数量,采用了
InfoGainAttributeEval
和Ranker
。我选择了排名前五的属性,结果如下:Evaluator: weka.attributeSelection.InfoGainAttributeEval Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1Relation: breast-cancerInstances: 286Attributes: 10 age menopause tumor-size inv-nodes node-caps deg-malig breast breast-quad irradiat ClassEvaluation mode: 10-fold cross-validation=== Attribute selection 10 fold cross-validation (stratified), seed: 1 ===average merit average rank attribute0.078 +- 0.011 1.3 +- 0.64 6 deg-malig0.071 +- 0.01 1.9 +- 0.3 4 inv-nodes0.061 +- 0.008 3 +- 0.77 3 tumor-size0.051 +- 0.007 3.8 +- 0.4 5 node-caps0.026 +- 0.006 5 +- 0 9 irradiat0.012 +- 0.003 6.4 +- 0.49 1 age0.01 +- 0.003 6.6 +- 0.49 8 breast-quad0.003 +- 0.001 8.5 +- 0.5 7 breast0.003 +- 0.002 8.5 +- 0.5 2 menopause
-
在移除排名较低的变量后,我继续创建我的模型。我选择了多层感知器,因为这是我所参考的期刊要求的算法。
Bernhard Pfahringe的建议是将learning rate
和momentum
设为0.1
,并将hidden nodes
和epoch
等的指数因子设为1, 2, 4, 8。
经过几次尝试后,我注意到使用2作为隐藏层,并使用二进制数的十进制等价物,即512, 1024, 2048, …,可以提高准确性。例如,hidden node
为2,epoch
为1024等。
我得到了一系列不同的结果,但目前最高的准确性是使用hidden node
为2和epoch
为16384时达到的:
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.1 -M 0.1 -N 16384 -V 0 -S 0 -E 20 -H 2 Relation: breast-cancer-weka.filters.unsupervised.attribute.Remove-R1-2,7-8 Instances: 286 Attributes: 6 tumor-size inv-nodes node-caps deg-malig irradiat Class Test mode: 10-fold cross-validation === Classifier model (full training set) === Sigmoid Node 0 Inputs Weights Threshold -2.4467109489840375 Node 2 2.960926490700117 Node 3 1.5276384018358489 Sigmoid Node 1 Inputs Weights Threshold 2.446710948984037 Node 2 -2.9609264907001167 Node 3 -1.5276384018358493 Sigmoid Node 2 Inputs Weights Threshold 0.8594931368555995 Attrib tumor-size=0-4 -0.6809394102558067 Attrib tumor-size=5-9 -0.7999278705976403 Attrib tumor-size=10-14 -0.5139914771540879 Attrib tumor-size=15-19 2.3071396030112834 Attrib tumor-size=20-24 -6.316868254289899 Attrib tumor-size=25-29 5.535754474315768 Attrib tumor-size=30-34 -12.31495416708197 Attrib tumor-size=35-39 2.165860489861981 Attrib tumor-size=40-44 10.740913335424047 Attrib tumor-size=45-49 9.102261927484186 Attrib tumor-size=50-54 -17.072392893550735 Attrib tumor-size=55-59 0.043056333044031 Attrib inv-nodes=0-2 9.578867366884618 Attrib inv-nodes=3-5 1.3248317047328586 Attrib inv-nodes=6-8 -5.081199984305494 Attrib inv-nodes=9-11 -8.604844224457239 Attrib inv-nodes=12-14 2.2330604430275907 Attrib inv-nodes=15-17 -2.8692154868988355 Attrib inv-nodes=18-20 0.04225234708199947 Attrib inv-nodes=21-23 0.017664071511846485 Attrib inv-nodes=24-26 -0.9992481277256989 Attrib inv-nodes=27-29 -0.02737484354173595 Attrib inv-nodes=30-32 -0.04607516719307534 Attrib inv-nodes=33-35 -0.038969156415242706 Attrib inv-nodes=36-39 0.03338452826774849 Attrib node-caps 6.764954936579671 Attrib deg-malig=1 -5.037151186065571 Attrib deg-malig=2 12.469858109768378 Attrib deg-malig=3 -8.382625277311769 Attrib irradiat 8.302010702287868 Sigmoid Node 3 Inputs Weights Threshold -0.7428771456532647 Attrib tumor-size=0-4 3.5709673152321555 Attrib tumor-size=5-9 3.563713261511895 Attrib tumor-size=10-14 7.86118954430952 Attrib tumor-size=15-19 2.8762105204084167 Attrib tumor-size=20-24 4.60168522637948 Attrib tumor-size=25-29 -5.849391383398816 Attrib tumor-size=30-34 -1.6805815971562046 Attrib tumor-size=35-39 -12.022394228003419 Attrib tumor-size=40-44 11.922229608392747 Attrib tumor-size=45-49 -1.9939414047194557 Attrib tumor-size=50-54 -5.9801974214306215 Attrib tumor-size=55-59 -0.04909236196295539 Attrib inv-nodes=0-2 5.569516359775502 Attrib inv-nodes=3-5 -7.871275549119543 Attrib inv-nodes=6-8 3.405277467966008 Attrib inv-nodes=9-11 -0.3253699778307026 Attrib inv-nodes=12-14 1.244234346055825 Attrib inv-nodes=15-17 1.179311225120216 Attrib inv-nodes=18-20 0.03495291263409073 Attrib inv-nodes=21-23 0.0043299366591334695 Attrib inv-nodes=24-26 0.6595250300030937 Attrib inv-nodes=27-29 -0.02503529326219822 Attrib inv-nodes=30-32 0.041787638417097844 Attrib inv-nodes=33-35 0.008416652090130837 Attrib inv-nodes=36-39 -0.014551878794926747 Attrib node-caps 4.7997880904143955 Attrib deg-malig=1 1.6752746955482163 Attrib deg-malig=2 6.130488722916935 Attrib deg-malig=3 -6.989852429736567 Attrib irradiat 8.716254786514295 Class no-recurrence-events Input Node 0 Class recurrence-events Input Node 1 Time taken to build model: 27.05 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 210 73.4266 % Incorrectly Classified Instances 76 26.5734 % Kappa statistic 0.2864 Mean absolute error 0.3312 Root mean squared error 0.4494 Relative absolute error 79.1456 % Root relative squared error 98.3197 % Coverage of cases (0.95 level) 98.951 % Mean rel. region size (0.95 level) 97.7273 % Total Number of Instances 286 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.891 0.635 0.768 0.891 0.825 0.300 0.633 0.748 no-recurrence-events 0.365 0.109 0.585 0.365 0.449 0.300 0.633 0.510 recurrence-events Weighted Avg. 0.734 0.479 0.714 0.734 0.713 0.300 0.633 0.677 === Confusion Matrix === a b <-- classified as 179 22 | a = no-recurrence-events 54 31 | b = recurrence-events
我的问题是如何将数据的准确性提高到至少90%?我是否需要进行过滤,或者使用另一种MLP输入参数模式?
我计划在学会如何做这件事后使用另一组数据(它有大约50个变量和100,000个实例)。
回答:
显然,对于这样的问题没有完美的答案,但我会给你一些关于使用MLP的更或少的一般性提示:
- 首先,为什么你在处理如此小的数据集时要移除特征?特征选择在高维问题和/或计算上昂贵的模型中很重要。对于乳腺癌和MLP来说,这两者都不适用。
- 迭代次数是MLP最差的停止标准,你应该在验证误差上升时停止训练,而不是在固定的迭代次数后停止。
- 我不知道你使用的是什么成本函数,但最重要的是正则化,因为MLP容易过拟合。最低限度需要一些Tikhonov正则化。
- 对于这样的问题,使用超过一个隐藏层是完全多余的。特别是,由于梯度消失现象,训练多个隐藏层在MLP中通常是不可能的。
- 为了摆脱学习算法的参数化,我还建议放弃原始算法,至少使用弹性传播,这在许多应用中证明效果很好。