将Spark ML模型保存到HDFS

我试图将从Spark ML库创建的模型对象保存起来。

然而,这引发了一个错误:

线程 “main” 中的异常 java.lang.NoSuchMethodError: org.apache.spark.ml.PipelineModel.save(Ljava/lang/String;)V at com.sf.prediction$.main(prediction.scala:61) at com.sf.prediction.main(prediction.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

以下是我的依赖项:

    <dependency>        <groupId>org.scalatest</groupId>        <artifactId>scalatest_2.10</artifactId>        <version>2.1.7</version>        <scope>test</scope>    </dependency>    <dependency>        <groupId>org.apache.maven.plugins</groupId>        <artifactId>maven-shade-plugin</artifactId>        <version>2.4.3</version>        <type>maven-plugin</type>    </dependency>    <dependency>        <groupId>org.apache.spark</groupId>        <artifactId>spark-core_2.10</artifactId>        <version>1.6.0</version>    </dependency>    <dependency>        <groupId>org.scala-lang</groupId>        <artifactId>scala-parser-combinators</artifactId>        <version>2.11.0-M4</version>    </dependency>    <dependency>        <groupId>org.apache.spark</groupId>        <artifactId>spark-sql_2.10</artifactId>        <version>1.6.0</version>    </dependency>    <dependency>        <groupId>org.apache.commons</groupId>        <artifactId>commons-csv</artifactId>        <version>1.2</version>    </dependency>    <dependency>        <groupId>com.databricks</groupId>        <artifactId>spark-csv_2.10</artifactId>        <version>1.4.0</version>    </dependency>    <dependency>        <groupId>org.apache.spark</groupId>        <artifactId>spark-hive_2.10</artifactId>        <version>1.6.1</version>    </dependency>    <dependency>        <groupId>org.apache.spark</groupId>        <artifactId>spark-mllib_2.10</artifactId>        <version>1.6.0</version>    </dependency>

我还想将模型生成的数据框保存为CSV格式。

model.transform(df).select("features","label","prediction").show()import org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.sql.SQLContextimport org.apache.spark.sql.functions._import org.apache.spark.SparkConfimport org.apache.spark.sql.hive.HiveContextimport org.apache.spark.ml.feature.OneHotEncoderimport org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.ml.classification.LogisticRegressionimport org.apache.spark.ml.Pipelineimport org.apache.spark.ml.PipelineModel._import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}import org.apache.spark.ml.util.MLWritableobject prediction {  def main(args: Array[String]): Unit = {     val conf = new SparkConf()             .setMaster("local[2]")             .setAppName("conversion")    val sc = new SparkContext(conf)    val hiveContext = new HiveContext(sc)    val df = hiveContext.sql("select * from prediction_test")    df.show()    val credit_indexer = new StringIndexer().setInputCol("transaction_credit_card").setOutputCol("creditCardIndex").fit(df)    val category_indexer = new StringIndexer().setInputCol("transaction_category").setOutputCol("categoryIndex").fit(df)    val location_flag_indexer = new StringIndexer().setInputCol("location_flag").setOutputCol("locationIndex").fit(df)    val label_indexer = new StringIndexer().setInputCol("fraud").setOutputCol("label").fit(df)    val assembler =  new VectorAssembler().setInputCols(Array("transaction_amount", "creditCardIndex","categoryIndex","locationIndex")).setOutputCol("features")    val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01)    val pipeline = new Pipeline().setStages(Array(credit_indexer, category_indexer, location_flag_indexer, label_indexer, assembler, lr))    val model = pipeline.fit(df)    pipeline.save("/user/f42h/prediction/pipeline")    model.save("/user/f42h/prediction/model") //   val sameModel = PipelineModel.load("/user/bob/prediction/model")    model.transform(df).select("features","label","prediction")  }}

回答:

您使用的是Spark 1.6.0,据我所知,ML模型的保存和加载功能是从2.0版本开始才可用的。您可以使用带有2.0.0-preview版本的工件进行预览: http://search.maven.org/#search%7Cga%7C1%7Cg%3Aorg.apache.spark%20v%3A2.0.0-preview

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注