我已经构建了一个KMeans模型。我的结果存储在一个名为transformed
的PySpark DataFrame中。
(a) 我如何解释transformed
的内容?
(b) 我如何从transformed
创建一个或多个Pandas DataFrame,以显示每个14个聚类的13个特征的摘要统计数据?
from pyspark.ml.clustering import KMeans# Trains a k-means model.kmeans = KMeans().setK(14).setSeed(1)model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional parameters.transformed = model.transform(X_spark_scaled).select("features", "prediction") # X_spark_scaled是我的PySpark DataFrame,由13个特征组成transformed.show(5, truncate = False)+------------------------------------------------------------------------------------------------------------------------------------+----------+|features |prediction|+------------------------------------------------------------------------------------------------------------------------------------+----------+|(14,[4,5,7,8,9,13],[1.0,1.0,485014.0,0.25,2.0,1.0]) |12 ||(14,[2,7,8,9,12,13],[1.0,2401233.0,1.0,1.0,1.0,1.0]) |2 ||(14,[2,4,5,7,8,9,13],[0.3333333333333333,0.6666666666666666,0.6666666666666666,2429111.0,0.9166666666666666,1.3333333333333333,3.0])|2 ||(14,[4,5,7,8,9,12,13],[1.0,1.0,2054748.0,0.15384615384615385,11.0,1.0,1.0]) |11 ||(14,[2,7,8,9,13],[1.0,43921.0,1.0,1.0,1.0]) |1 |+------------------------------------------------------------------------------------------------------------------------------------+----------+only showing top 5 rows
顺便提一下,我从另一个SO帖子中发现,我可以像下面这样将特征映射到它们的名称。如果能在一个或多个Pandas DataFrame中为每个聚类的每个特征提供摘要统计数据(均值、中位数、标准差、最小值、最大值)就好了。
attr_list = [attr for attr in chain(*transformed.schema['features'].metadata['ml_attr']['attrs'].values())]attr_list
根据评论中的请求,这里是包含2条记录的数据快照(不想提供太多记录——这里有专有信息)
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+|device_type_robot_pct|device_type_smart_tv_pct|device_type_desktop_pct|device_type_tablet_pct|device_type_mobile_pct|device_type_mobile_persist_pct|visitors_seen_with_anonymiser_pct|ip_time_span| ip_weight|mean_ips_per_visitor|visitors_seen_with_multi_country_pct|international_visitors_pct|visitors_seen_with_multi_ua_pct|count_tuids_on_ip| features| scaledFeatures|+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+| 0.0| 0.0| 0.0| 0.0| 1.0| 1.0| 0.0| 485014.0| 0.25| 2.0| 0.0| 0.0| 0.0| 1.0|(14,[4,5,7,8,9,13...|(14,[4,5,7,8,9,13...|| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0| 0.0| 2401233.0| 1.0| 1.0| 0.0| 0.0| 1.0| 1.0|(14,[2,7,8,9,12,1...|(14,[2,7,8,9,12,1...|
回答:
正如Anony-Mousse评论的那样,(Py)Spark ML确实比scikit-learn或其他类似包的功能要有限得多,这样的功能并不简单;尽管如此,这里有一种获取你想要的(聚类统计数据)的方法:
spark.version# u'2.2.0'from pyspark.ml.clustering import KMeansfrom pyspark.ml.linalg import Vectors# 玩具数据 - 包含稀疏向量的5维特征df = spark.createDataFrame( [(Vectors.sparse(5,[(0, 164.0),(1,520.0)]), 1.0), (Vectors.dense([519.0,2723.0,0.0,3.0,4.0]), 1.0), (Vectors.sparse(5,[(0, 2868.0), (1, 928.0)]), 1.0), (Vectors.sparse(5,[(0, 57.0), (1, 2715.0)]), 0.0), (Vectors.dense([1241.0,2104.0,0.0,0.0,2.0]), 1.0)], ["features", "target"])df.show()# +--------------------+------+ # | features|target| # +--------------------+------+ # |(5,[0,1],[164.0,5...| 1.0|# |[519.0,2723.0,0.0...| 1.0| # |(5,[0,1],[2868.0,...| 1.0|# |(5,[0,1],[57.0,27...| 0.0| # |[1241.0,2104.0,0....| 1.0|# +--------------------+------+kmeans = KMeans(k=3, seed=1)model = kmeans.fit(df.select('features'))transformed = model.transform(df).select("features", "prediction")transformed.show()# +--------------------+----------+# | features|prediction|# +--------------------+----------+# |(5,[0,1],[164.0,5...| 1| # |[519.0,2723.0,0.0...| 2|# |(5,[0,1],[2868.0,...| 0|# |(5,[0,1],[57.0,27...| 2|# |[1241.0,2104.0,0....| 2|# +--------------------+----------+