我想使用我创建的颜色映射(字典形式,即{leaf: color}
)来为我的聚类着色。
我尝试按照https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/的指导进行操作,但不知为何颜色出现了混乱。默认的图看起来不错,我只是想以不同的方式分配这些颜色。我注意到有一个link_color_func
,但当我尝试使用我的颜色映射(D_leaf_color
字典)时,我得到了一个错误,因为它不是一个函数。我创建了D_leaf_color
来定制与特定聚类相关的叶子的颜色。在我的实际数据集中,这些颜色是有意义的,所以我避免使用任意颜色分配。
我不想使用color_threshold
,因为在我的实际数据中,我有更多的聚类,而SciPy
会重复使用颜色,因此才有这个问题…
我如何使用我的叶子颜色字典来定制我的树状图聚类的颜色?
我在GitHub上提出了一个问题https://github.com/scipy/scipy/issues/6346,我在其中进一步阐述了为叶子着色的方法,参见如何解释SciPy层次聚类树状图的输出?(可能发现了一个错误…),但我仍然无法弄清楚如何实际做到以下两点之一:(i) 使用树状图输出重建我的树状图,并使用我指定的颜色字典,或(ii) 重新格式化我的D_leaf_color
字典以适应link_color_func
参数。
# Initimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns; sns.set()# Load datafrom sklearn.datasets import load_diabetes# Clusteringfrom scipy.cluster.hierarchy import dendrogram, fcluster, leaves_listfrom scipy.spatial import distancefrom fastcluster import linkage # You can use SciPy one too%matplotlib inline# DatasetA_data = load_diabetes().dataDF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])# Absolute value of correlation matrix, then subtract from 1 for disimilarityDF_dism = 1 - np.abs(DF_diabetes.corr())# Compute average linkageA_dist = distance.squareform(DF_dism.as_matrix())Z = linkage(A_dist,method="average")# Color mappingD_leaf_colors = {"attr_1": "#808080", # Unclustered gray "attr_4": "#B061FF", # Cluster 1 indigo "attr_5": "#B061FF", "attr_2": "#B061FF", "attr_8": "#B061FF", "attr_6": "#B061FF", "attr_7": "#B061FF", "attr_0": "#61ffff", # Cluster 2 cyan "attr_3": "#61ffff", "attr_9": "#61ffff", }# Dendrogram# To get this dendrogram coloring below `color_threshold=0.7`D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None, leaf_font_size=12, leaf_rotation=45, link_color_func=D_leaf_colors)# TypeError: 'dict' object is not callable
我还尝试了如何获取由scipy.cluster.hierarchy生成的树状图的子树
回答:
这里提供了一个解决方案,使用linkage()
返回的矩阵Z
(在文档中描述过,但有点隐藏)和link_color_func
:
# see question for code prior to "color mapping"# Color mappingdflt_col = "#808080" # Unclustered grayD_leaf_colors = {"attr_1": dflt_col, "attr_4": "#B061FF", # Cluster 1 indigo "attr_5": "#B061FF", "attr_2": "#B061FF", "attr_8": "#B061FF", "attr_6": "#B061FF", "attr_7": "#B061FF", "attr_0": "#61ffff", # Cluster 2 cyan "attr_3": "#61ffff", "attr_9": "#61ffff", }# notes:# * rows in Z correspond to "inverted U" links that connect clusters# * rows are ordered by increasing distance# * if the colors of the connected clusters match, use that color for linklink_cols = {}for i, i12 in enumerate(Z[:,:2].astype(int)): c1, c2 = (link_cols[x] if x > len(Z) else D_leaf_colors["attr_%d"%x] for x in i12) link_cols[i+1+len(Z)] = c1 if c1 == c2 else dflt_col# DendrogramD = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None, leaf_font_size=12, leaf_rotation=45, link_color_func=lambda x: link_cols[x])