在Apache-Spark中使用支持向量机

当我在终端中尝试运行支持向量机时,使用./run-example org.apache.spark.mllib.classification.SVM local <path-to-dir>/sample_svm_data.txt 2 2.0 2命令在Apache-Spark中,我得到了以下错误消息。

Exception in thread "main" java.lang.NumberFormatException: For input string: "1 0 2.52078447201548 0 0 0 2.004684436494304 2.000347299268466 0 2.228387042742021 2.228387042742023 0 0 0 0 0 0"  at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)  at java.lang.Double.parseDouble(Double.java:540)  at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:234)  at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)  at org.apache.spark.mllib.util.MLUtils$$anonfun$loadLabeledData$1.apply(MLUtils.scala:45)  at org.apache.spark.mllib.util.MLUtils$$anonfun$loadLabeledData$1.apply(MLUtils.scala:43)  at scala.collection.Iterator$$anon$19.next(Iterator.scala:401)  at scala.collection.Iterator$$anon$18.next(Iterator.scala:385)  at scala.collection.Iterator$class.foreach(Iterator.scala:772)  at scala.collection.Iterator$$anon$18.foreach(Iterator.scala:379)  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102)  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)  at scala.collection.Iterator$$anon$18.toBuffer(Iterator.scala:379)  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)  at scala.collection.Iterator$$anon$18.toArray(Iterator.scala:379)  at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:768)  at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:768)  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)  at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:484)  at org.apache.spark.scheduler.DAGScheduler$$anon$2.run(DAGScheduler.scala:470)

为了进一步诊断,已添加完整的转储信息。

13/12/13 12:26:54 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started13/12/13 12:26:54 INFO spark.SparkEnv: Registering BlockManagerMaster13/12/13 12:26:54 INFO storage.MemoryStore: MemoryStore started with capacity 9.2 GB.13/12/13 12:26:54 INFO storage.DiskStore: Created local directory at /tmp/spark-local-20131213122654-abb213/12/13 12:26:54 INFO network.ConnectionManager: Bound socket to port 36563 with id = ConnectionManagerId(<master>,36563)13/12/13 12:26:54 INFO storage.BlockManagerMaster: Trying to register BlockManager13/12/13 12:26:54 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager <master>:36563 with 9.2 GB RAM13/12/13 12:26:54 INFO storage.BlockManagerMaster: Registered BlockManager13/12/13 12:26:54 INFO server.Server: jetty-7.x.y-SNAPSHOT13/12/13 12:26:54 INFO server.AbstractConnector: Started [email protected]:5663713/12/13 12:26:54 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.232.5.169:5663713/12/13 12:26:54 INFO spark.SparkEnv: Registering MapOutputTracker13/12/13 12:26:54 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-986ccc2b-5a40-48ae-8801-566b0f32895b13/12/13 12:26:54 INFO server.Server: jetty-7.x.y-SNAPSHOT13/12/13 12:26:54 INFO server.AbstractConnector: Started [email protected]:5961313/12/13 12:26:54 INFO server.Server: jetty-7.x.y-SNAPSHOT13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null}13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null}13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null}13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null}13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null}13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,null}13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,null}13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null}13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null}13/12/13 12:26:54 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,null}13/12/13 12:26:54 INFO server.AbstractConnector: Started [email protected]:404013/12/13 12:26:54 INFO ui.SparkUI: Started Spark Web UI at http://<master>:404013/12/13 12:26:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable13/12/13 12:26:55 INFO storage.MemoryStore: ensureFreeSpace(121635) called with curMem=0, maxMem=990787928013/12/13 12:26:55 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 118.8 KB, free 9.2 GB)13/12/13 12:26:55 INFO mapred.FileInputFormat: Total input paths to process : 113/12/13 12:26:55 INFO spark.SparkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:12113/12/13 12:26:55 INFO scheduler.DAGScheduler: Got job 0 (first at GeneralizedLinearAlgorithm.scala:121) with 1 output partitions (allowLocal=true)13/12/13 12:26:55 INFO scheduler.DAGScheduler: Final stage: Stage 0 (first at GeneralizedLinearAlgorithm.scala:121)13/12/13 12:26:55 INFO scheduler.DAGScheduler: Parents of final stage: List()13/12/13 12:26:55 INFO scheduler.DAGScheduler: Missing parents: List()13/12/13 12:26:55 INFO scheduler.DAGScheduler: Computing the requested partition locally13/12/13 12:26:55 INFO rdd.HadoopRDD: Input split: file:/data/tanmay/tmp/sample_svm_data.txt:0+3947413/12/13 12:26:55 INFO scheduler.DAGScheduler: Failed to run first at GeneralizedLinearAlgorithm.scala:121Exception in thread "main" java.lang.NumberFormatException: For input string: "1 0 2.52078447201548 0 0 0 2.004684436494304 2.000347299268466 0 2.228387042742021 2.228387042742023 0 0 0 0 0 0"  at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)  at java.lang.Double.parseDouble(Double.java:540)  at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:234)  at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)  at org.apache.spark.mllib.util.MLUtils$$anonfun$loadLabeledData$1.apply(MLUtils.scala:45)  at org.apache.spark.mllib.util.MLUtils$$anonfun$loadLabeledData$1.apply(MLUtils.scala:43)  at scala.collection.Iterator$$anon$19.next(Iterator.scala:401)  at scala.collection.Iterator$$anon$18.next(Iterator.scala:385)  at scala.collection.Iterator$class.foreach(Iterator.scala:772)  at scala.collection.Iterator$$anon$18.foreach(Iterator.scala:379)  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102)  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)  at scala.collection.Iterator$$anon$18.toBuffer(Iterator.scala:379)  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)  at scala.collection.Iterator$$anon$18.toArray(Iterator.scala:379)  at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:768)  at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:768)  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)  at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:484)  at org.apache.spark.scheduler.DAGScheduler$$anon$2.run(DAGScheduler.scala:470)

考虑到Apache-Spark已经随机器学习库一起提供了“sample_svm_data.txt”文件,[这意味着数据本身应该没有问题],请问有谁能帮助我找出这个数据(/输入参数)的具体问题吗?


回答:

问题在于数据使用空格作为分隔符,而mahout(SVM)默认无法解析这种格式。我只是将空格替换为逗号,结果就成功了!!

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

在Apache Spark中使用支持向量机

我想了解在Apache Spark中运行支持向量机(SVM)的一些见解。
当我使用Spark主目录中提供的run-example脚本,并使用参数org.apache.spark.mllib.classification.SVMWithSGD时,它显示了以下Usage: SVM <master> <input_dir> <step_size> <regularization_parameter> <niters>消息。我理解<master><input_dir><niters>这些参数的含义。
您能帮我弄清楚其余参数的含义吗,或者至少指引我到某个教程网站?


回答:

<step_size>是学习率的起始点。为了收敛,步长应该递减。在SGD中,这是通过将step_size的输入值除以迭代次数的平方根来实现的。

<reg_param>是调整约束强度的标量。小值意味着软边界,大值意味着硬边界。无穷大是硬边界的极限。

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注