我在尝试测试我在Spark文档中找到的代码,以便在Apache Spark中使用Java处理分类特征:
SparkSession spark = SparkSession .builder().master("local[4]") .appName("1-of-K encoding Test") .getOrCreate();List<Row> data = Arrays.asList( RowFactory.create(0, "a"), RowFactory.create(1, "b"), RowFactory.create(2, "c"), RowFactory.create(3, "a"), RowFactory.create(4, "a"), RowFactory.create(5, "c") );StructType schema = new StructType(new StructField[]{ new StructField("id", DataTypes.IntegerType, false,Metadata.empty()),new StructField("category", DataTypes.StringType, false, Metadata.empty()) });Dataset<Row> df = spark.createDataFrame(data, schema);StringIndexerModel indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df);
但是我遇到了这个错误;无法调用fit函数
你有什么想法吗?
回答:
你为什么要用这种较长的方式创建df?更有效的方法是:
import sparkSession.implicits._ val df = sparkSession.sparkContext.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "d"), (4, "e"), (5, "f"))).toDF("id", "category") val newDf = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) .transform(df) .show;
这会给出以下输出:
+---+--------+-------------+| id|category|categoryIndex|+---+--------+-------------+| 0| a| 2.0|| 1| b| 3.0|| 2| c| 4.0|| 3| d| 5.0|| 4| e| 0.0|| 5| f| 1.0|+---+--------+-------------+