我是一个Scala
的初学者。
我试图创建一个接受ProbabilisticClassifier
作为输入,并输出CrossValidator
模型的对象:
import org.apache.spark.ml.classification.{ProbabilisticClassifier, ProbabilisticClassificationModel}import org.apache.spark.ml.evaluation.BinaryClassificationEvaluatorimport org.apache.spark.ml.param.ParamMapimport org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}import constants.Constobject MyModels { def loadOrCreateModel[A, M, T]( model: ProbabilisticClassifier[Vector[T], A, M], paramGrid: Array[ParamMap]): CrossValidator = { // 二元评估器。 val binEvaluator = ( new BinaryClassificationEvaluator() .setLabelCol("yCol") ) // 交叉验证器。 val cvModel = ( new CrossValidator() .setEstimator(model) .setEvaluator(binEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3) ) cvModel }}
但这会导致以下错误:
sbt package[info] 从somepath/project加载项目定义[info] 从build.sbt加载设置...[info] 将当前项目设置为xxx(在构建文件:somepath/中)[info] 编译1个Scala源到somepath/target/scala-2.11/classes ...[error] somepath/src/main/scala/models.scala:11:12: 类型参数[Vector[T],A,M]不符合类ProbabilisticClassifier的类型参数界限[FeaturesType,E <: org.apache.spark.ml.classification.ProbabilisticClassifier[FeaturesType,E,M],M <: org.apache.spark.ml.classification.ProbabilisticClassificationModel[FeaturesType,M]][error] model: ProbabilisticClassifier[Vector[T], A, M],[error] ^[error] 发现一个错误[error] (Compile / compileIncremental) 编译失败[error] 总时间:3秒,完成于2018年3月31日下午4:22:31makefile:127: 目标'target/scala-2.11/classes/models/XModels.class'的配方失败make: *** [target/scala-2.11/classes/models/XModels.class] 错误1
我已经尝试了几种[A, M, T]
参数的组合,以及方法参数中不同的类型。
我的想法是能够将LogisticRegression
或RandomForestClassifier
传递给这个函数。根据文档:
class LogisticRegression extends ProbabilisticClassifier[Vector, LogisticRegression, LogisticRegressionModel] with LogisticRegressionParams with DefaultParamsWritable with Loggingclass RandomForestClassifier extends ProbabilisticClassifier[Vector, RandomForestClassifier, RandomForestClassificationModel] with RandomForestClassifierParams with DefaultParamsWritable
有人能指出我可以学习哪些资源来实现这个方法吗?
我正在使用Spark
2.1.0。
编辑01
感谢@Andrey Tyukin,
对不起,代码无法重现。实际上它是一个字符串。你的代码确实有效,但我可能表达错误了:
<console>:35: error: 类型不匹配;找到:org.apache.spark.ml.classification.LogisticRegression需要:org.apache.spark.ml.classification.ProbabilisticClassifier[Vector[?],?,?] val cvModel = models.TalkingDataModels.loadOrCreateModel(logistic_regressor, paramGrid)
所以也许我的想法从一开始就是错误的。是否可以创建一个方法,该方法可以接受LogisticRegression
或RandomForestClassifier
对象?
-
编辑代码以符合MCVE标准:
import org.apache.spark.ml.classification.{ProbabilisticClassifier, ProbabilisticClassificationModel}import org.apache.spark.ml.evaluation.BinaryClassificationEvaluatorimport org.apache.spark.ml.param.ParamMapimport org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}import org.apache.spark.ml.classification.LogisticRegressionobject MyModels {def main(array: Array[String]): Unit = { val logisticRegressor = ( new LogisticRegression() .setFeaturesCol("yCol") .setLabelCol("labels") .setMaxIter(10) ) val paramGrid = ( new ParamGridBuilder() .addGrid(logisticRegressor.regParam, Array(0.01, 0.1, 1)) .build() ) loadOrCreateModel(logisticRegressor, paramGrid) println()}def loadOrCreateModel[ F, M <: ProbabilisticClassificationModel[Vector[F], M], P <: ProbabilisticClassifier[Vector[F], P, M] ]( probClassif: ProbabilisticClassifier[Vector[F], P, M], paramGrid: Array[ParamMap] ): CrossValidator = { // 二元评估器。 val binEvaluator = new BinaryClassificationEvaluator() .setLabelCol("y") // 交叉验证器。 val cvModel = new CrossValidator() .setEstimator(probClassif) .setEvaluator(binEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3) cvModel }}
回答:
这里的代码可以编译,但我不得不抛弃你的constants.Const.yColumn
字符串,并用魔法值"y"
替换它:
import org.apache.spark.ml.classification.{ProbabilisticClassifier, ProbabilisticClassificationModel}import org.apache.spark.ml.evaluation.BinaryClassificationEvaluatorimport org.apache.spark.ml.param.ParamMapimport org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}object CrossValidationExample { def loadOrCreateModel[ F, M <: ProbabilisticClassificationModel[Vector[F], M], P <: ProbabilisticClassifier[Vector[F], P, M] ]( probClassif: ProbabilisticClassifier[Vector[F], P, M], paramGrid: Array[ParamMap] ): CrossValidator = { // 二元评估器。 val binEvaluator = new BinaryClassificationEvaluator() .setLabelCol("y") // 交叉验证器。 val cvModel = new CrossValidator() .setEstimator(probClassif) .setEvaluator(binEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3) cvModel }}
在定义泛型参数列表之前,可能有帮助的是在脑海中进行拓扑排序,以了解哪些参数依赖于其他参数。
在这里,模型依赖于特征的类型,而概率分类器既依赖于特征的类型,也依赖于模型的类型。因此,最好按顺序声明参数:特征、模型、分类器。然后你需要正确处理F-边界多态性。
啊,顺便说一下,埃及括号风格的缩进在我看来是唯一合理的缩进多参数列表的方式,尤其是当类型参数长达五十英里时(不幸的是,你无法改变类型参数的长度,在我见过的每个机器学习库中,它们往往都很长)。
编辑(对第二个MCVE部分的回答)
这是一个相当直接的泛化。如果它想要linalg.Vector
而不是Vector[Feature]
,那么就对它进行抽象:
import org.apache.spark.ml.classification.{ProbabilisticClassifier, ProbabilisticClassificationModel}import org.apache.spark.ml.evaluation.BinaryClassificationEvaluatorimport org.apache.spark.ml.param.ParamMapimport org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}import org.apache.spark.ml.classification.RandomForestClassifierimport org.apache.spark.ml.linalg.{Vector => LinalgVector}object CrossValidationExample { def main(array: Array[String]): Unit = { val logisticRegressor = ( new LogisticRegression() .setFeaturesCol("yCol") .setLabelCol("labels") .setMaxIter(10) ) val paramGrid = ( new ParamGridBuilder() .addGrid(logisticRegressor.regParam, Array(0.01, 0.1, 1)) .build() ) loadOrCreateModel(logisticRegressor, paramGrid) val rfc: RandomForestClassifier = ??? loadOrCreateModel(rfc, paramGrid) } def loadOrCreateModel[ FeatVec, M <: ProbabilisticClassificationModel[FeatVec, M], P <: ProbabilisticClassifier[FeatVec, P, M] ]( probClassif: ProbabilisticClassifier[FeatVec, P, M], paramGrid: Array[ParamMap] ): CrossValidator = { // 二元评估器。 val binEvaluator = new BinaryClassificationEvaluator() .setLabelCol("y") // 交叉验证器。 val cvModel = new CrossValidator() .setEstimator(probClassif) .setEvaluator(binEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3) cvModel }}