使用Spark进行机器学习，数据准备阶段的性能问题，MLeap

我发现关于MLeap的很多积极反馈——这是一个允许快速评分的库。它基于一个转换为MLeap包的模型运行。

但是在评分之前的数据准备阶段呢？

有没有什么有效的方法可以将’Spark ML数据准备流程’（在训练期间使用Spark框架运行）转换为高效、性能优化的字节码？

回答：

你可以使用MLeap轻松地序列化你的整个PipelineModel（包含特征工程和模型训练）。

注意：以下代码有点旧，现在你可能可以使用更清晰的API了。

// Mleap PipelineModel Serialization into a single .zip fileval sparkBundleContext = SparkBundleContext().withDataset(pipelineModel.transform(trainData))for(bundleFile <- managed(BundleFile(s"jar:file:${mleapSerializedPipelineModel}"))) {  pipelineModel.writeBundle.save(bundleFile)(sparkBundleContext).get}// Mleap code: Deserialize model from local filesystem (without any Spark dependency)val mleapPipeline = (for(bf <- managed(BundleFile(s"jar:file:${modelPath}"))) yield {  bf.loadMleapBundle().get.root}).tried.get

需要注意的是，如果你在Spark中定义了自己的Estimators/Transformers，它们也需要对应的MLeap版本。

学技术

使用Spark进行机器学习，数据准备阶段的性能问题，MLeap

发表回复取消回复

相关文章：

Related Posts

为什么我们在K-means聚类方法中使用kmeans.fit函数？

如何获取Keras中ImageDataGenerator的.flow_from_directory函数扫描的类名？

如何查看每个词的tf-idf得分

如何修复 ‘ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]’？

如何向神经网络输入两个不同大小的输入？

逻辑回归与机器学习有何关联

发表回复 取消回复

发表回复取消回复