如何在Apache Spark的MLlib中将数值和分类特征传递给RandomForestRegressor?
我能够单独处理数值或分类特征,但不知道如何将它们一起实现。
我目前的工作代码如下(仅使用数值特征进行预测)
String[] featureNumericalCols = new String[]{ "squareM", "timeTimeToPragueCityCenter",};String[] featureStringCols = new String[]{ //未使用 "type", "floor", "disposition",};VectorAssembler assembler = new VectorAssembler().setInputCols(featureNumericalCols).setOutputCol("features");Dataset<Row> numericalData = assembler.transform(data);numericalData.show();RandomForestRegressor rf = new RandomForestRegressor().setLabelCol("price") .setFeaturesCol("features");// 将索引器和森林链接到Pipeline中Pipeline pipeline = new Pipeline() .setStages(new PipelineStage[]{assembler, rf});// 训练模型。这也运行了索引器。PipelineModel model = pipeline.fit(trainingData);// 进行预测。Dataset<Row> predictions = model.transform(testData);
回答:
对于所有人,这里是解决方案:
StringIndexer typeIndexer = new StringIndexer() .setInputCol("type") .setOutputCol("typeIndex"); preparedData = typeIndexer.fit(preparedData).transform(preparedData); StringIndexer floorIndexer = new StringIndexer() .setInputCol("floor") .setOutputCol("floorIndex"); preparedData = floorIndexer.fit(preparedData).transform(preparedData); StringIndexer dispositionIndexer = new StringIndexer() .setInputCol("disposition") .setOutputCol("dispositionIndex"); preparedData = dispositionIndexer.fit(preparedData).transform(preparedData); String[] featureCols = new String[]{ "squareM", "timeTimeToPragueCityCenter", "typeIndex", "floorIndex", "dispositionIndex" }; VectorAssembler assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features"); preparedData = assembler.transform(preparedData); // ... 一些更多的实现细节 RandomForestRegressor rf = new RandomForestRegressor().setLabelCol("price") .setFeaturesCol("features"); return rf.fit(preparedData);