PySpark ML: 获取KMeans聚类统计

我已经构建了一个KMeans模型。我的结果存储在一个名为transformed的PySpark DataFrame中。

(a) 我如何解释transformed的内容?

(b) 我如何从transformed创建一个或多个Pandas DataFrame,以显示每个14个聚类的13个特征的摘要统计数据?

from pyspark.ml.clustering import KMeans# Trains a k-means model.kmeans = KMeans().setK(14).setSeed(1)model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional parameters.transformed = model.transform(X_spark_scaled).select("features", "prediction") # X_spark_scaled是我的PySpark DataFrame,由13个特征组成transformed.show(5, truncate = False)+------------------------------------------------------------------------------------------------------------------------------------+----------+|features                                                                                                                            |prediction|+------------------------------------------------------------------------------------------------------------------------------------+----------+|(14,[4,5,7,8,9,13],[1.0,1.0,485014.0,0.25,2.0,1.0])                                                                                 |12        ||(14,[2,7,8,9,12,13],[1.0,2401233.0,1.0,1.0,1.0,1.0])                                                                                |2         ||(14,[2,4,5,7,8,9,13],[0.3333333333333333,0.6666666666666666,0.6666666666666666,2429111.0,0.9166666666666666,1.3333333333333333,3.0])|2         ||(14,[4,5,7,8,9,12,13],[1.0,1.0,2054748.0,0.15384615384615385,11.0,1.0,1.0])                                                         |11        ||(14,[2,7,8,9,13],[1.0,43921.0,1.0,1.0,1.0])                                                                                         |1         |+------------------------------------------------------------------------------------------------------------------------------------+----------+only showing top 5 rows

顺便提一下,我从另一个SO帖子中发现,我可以像下面这样将特征映射到它们的名称。如果能在一个或多个Pandas DataFrame中为每个聚类的每个特征提供摘要统计数据(均值、中位数、标准差、最小值、最大值)就好了。

attr_list = [attr for attr in chain(*transformed.schema['features'].metadata['ml_attr']['attrs'].values())]attr_list

根据评论中的请求,这里是包含2条记录的数据快照(不想提供太多记录——这里有专有信息)

+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+|device_type_robot_pct|device_type_smart_tv_pct|device_type_desktop_pct|device_type_tablet_pct|device_type_mobile_pct|device_type_mobile_persist_pct|visitors_seen_with_anonymiser_pct|ip_time_span|          ip_weight|mean_ips_per_visitor|visitors_seen_with_multi_country_pct|international_visitors_pct|visitors_seen_with_multi_ua_pct|count_tuids_on_ip|            features|      scaledFeatures|+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+|                  0.0|                     0.0|                    0.0|                   0.0|                   1.0|                           1.0|                              0.0|    485014.0|               0.25|                 2.0|                                 0.0|                       0.0|                            0.0|              1.0|(14,[4,5,7,8,9,13...|(14,[4,5,7,8,9,13...||                  0.0|                     0.0|                    1.0|                   0.0|                   0.0|                           0.0|                              0.0|   2401233.0|                1.0|                 1.0|                                 0.0|                       0.0|                            1.0|              1.0|(14,[2,7,8,9,12,1...|(14,[2,7,8,9,12,1...|

回答:

正如Anony-Mousse评论的那样,(Py)Spark ML确实比scikit-learn或其他类似包的功能要有限得多,这样的功能并不简单;尽管如此,这里有一种获取你想要的(聚类统计数据)的方法:

spark.version# u'2.2.0'from pyspark.ml.clustering import KMeansfrom pyspark.ml.linalg import Vectors# 玩具数据 - 包含稀疏向量的5维特征df = spark.createDataFrame( [(Vectors.sparse(5,[(0, 164.0),(1,520.0)]), 1.0),  (Vectors.dense([519.0,2723.0,0.0,3.0,4.0]), 1.0),  (Vectors.sparse(5,[(0, 2868.0), (1, 928.0)]), 1.0),  (Vectors.sparse(5,[(0, 57.0), (1, 2715.0)]), 0.0),  (Vectors.dense([1241.0,2104.0,0.0,0.0,2.0]), 1.0)], ["features", "target"])df.show()# +--------------------+------+ # |            features|target| # +--------------------+------+ # |(5,[0,1],[164.0,5...|   1.0|# |[519.0,2723.0,0.0...|   1.0| # |(5,[0,1],[2868.0,...|   1.0|# |(5,[0,1],[57.0,27...|   0.0| # |[1241.0,2104.0,0....|   1.0|# +--------------------+------+kmeans = KMeans(k=3, seed=1)model = kmeans.fit(df.select('features'))transformed = model.transform(df).select("features", "prediction")transformed.show()# +--------------------+----------+# |            features|prediction|# +--------------------+----------+# |(5,[0,1],[164.0,5...|         1| # |[519.0,2723.0,0.0...|         2|# |(5,[0,1],[2868.0,...|         0|# |(5,[0,1],[57.0,27...|         2|# |[1241.0,2104.0,0....|         2|# +--------------------+----------+

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注