我已经有一个可以接受的模型,但我希望通过在Spark ML Pipeline中使用CrossValidator和ParamGridBuilder来调整其参数以改进模型。
作为Estimator,我将使用现有的管道。在ParamMaps中,我不知道该放什么,我不理解它。作为Evaluator,我将使用之前已经创建的RegressionEvaluator。
我打算进行5折交叉验证,并使用树中10个不同深度值的列表进行测试。
如何选择并显示RMSE最低的最佳模型?
实际例子:
from pyspark.ml import Pipeline from pyspark.ml.regression import DecisionTreeRegressor from pyspark.ml.feature import VectorIndexer from pyspark.ml.evaluation import RegressionEvaluator dt = DecisionTreeRegressor() dt.setPredictionCol("Predicted_PE") dt.setMaxBins(100) dt.setFeaturesCol("features") dt.setLabelCol("PE") dt.setMaxDepth(8) pipeline = Pipeline(stages=[vectorizer, dt]) model = pipeline.fit(trainingSetDF) regEval = RegressionEvaluator(predictionCol = "Predicted_XX", labelCol = "XX", metricName = "rmse") rmse = regEval.evaluate(predictions) print("Root Mean Squared Error: %.2f" % rmse) (1) Spark Jobs (2) Root Mean Squared Error: 3.60
需求:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder dt2 = DecisionTreeRegressor() dt2.setPredictionCol("Predicted_PE") dt2.setMaxBins(100) dt2.setFeaturesCol("features") dt2.setLabelCol("PE") dt2.setMaxDepth(10) pipeline2 = Pipeline(stages=[vectorizer, dt2]) model2 = pipeline2.fit(trainingSetDF) regEval2 = RegressionEvaluator(predictionCol = "Predicted_PE", labelCol = "PE", metricName = "rmse") paramGrid = ParamGridBuilder().build() # ?????? crossval = CrossValidator(estimator = pipeline2, estimatorParamMaps = paramGrid, evaluator=regEval2, numFolds = 5) # ????? rmse2 = regEval2.evaluate(predictions) #bestPipeline = ???? #bestLRModel = ???? #bestParams = ???? print("Root Mean Squared Error: %.2f" % rmse2) (1) Spark Jobs (2) Root Mean Squared Error: 3.60 # the same ¿?
回答:
你需要在crossval对象上调用.fit()方法,并使用你的训练数据来创建cv模型。这将执行交叉验证。然后,你可以从中获取最佳模型(根据你的评估指标)。例如:
cvModel = crossval.fit(trainingData) myBestModel = cvModel.bestModel