我使用 PySpark 创建了一个 K-means 算法。现在,我想提取聚类中心。我该如何在管道中包含它?这是我目前的代码,但它抛出了一个错误 ‘AttributeError: ‘PipelineModel’ 对象没有属性 ‘clusterCenters’。如何修复这个问题?
#### model K-Means ###from pyspark.ml.clustering import KMeans, KMeansModelkmeans = KMeans() \ .setK(3) \ .setFeaturesCol("scaledFeatures")\ .setPredictionCol("cluster")# Chain indexer and tree in a Pipelinepipeline = Pipeline(stages=[kmeans])model = pipeline.fit(matrix_normalized)cluster = model.transform(matrix_normalized)#get cluster centerscenters = model.clusterCenters()
回答:
虚拟数据
from pyspark.ml.linalg import Vectorsfrom pyspark.ml.clustering import KMeans, KMeansModelfrom pyspark.ml.pipeline import Pipelinedata = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),), (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]matrix_normalized = spark.createDataFrame(data, ["scaledFeatures"])
你的代码
kmeans = KMeans() \ .setK(3) \ .setFeaturesCol("scaledFeatures")\ .setPredictionCol("cluster")# Chain indexer and tree in a Pipelinepipeline = Pipeline(stages=[kmeans])model = pipeline.fit(matrix_normalized)cluster = model.transform(matrix_normalized)
只需更改最后一行
model.stages[0].clusterCenters()[array([0.5, 0.5]), array([8., 9.]), array([9., 8.])]