Home IT技术如何合并DataFrame中的多个特征向量？

如何合并DataFrame中的多个特征向量？

IT技术 xiaolong · 2025年4月8日 · 0 Comment

使用Spark ML变换器，我得到了一个DataFrame，其中每一行如下所示：

Row(object_id, text_features_vector, color_features, type_features)

其中text_features是一个术语权重的稀疏向量，color_features是一个包含20个元素的颜色的一元编码密集向量，type_features也是类型的一元编码密集向量。

使用Spark的功能，有什么好的方法可以将这些特征合并成一个大的数组，以便我可以测量任意两个对象之间的余弦距离？

回答：

你可以使用VectorAssembler：

import org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.sql.DataFrameval df: DataFrame = ???val assembler = new VectorAssembler()  .setInputCols(Array("text_features", "color_features", "type_features"))  .setOutputCol("features")val transformed = assembler.transform(df)

关于PySpark的示例，请参见：在PySpark中编码和组装多个特征

apache-spark apache-spark-mllib apache-spark-sql machine-learning

发表回复取消回复