我有一组数据,是1000个同源蛋白质序列的距离矩阵。
我已经成功计算了亲和矩阵(在我的情况下,计算方法很简单:1 – 距离)。
基本上,如果数据在Excel中查看,没有标题行,第一列是序列名称,接下来的1000列是距离值。
我已经修改了sklearn的Affinity Propagation网站上提供的代码。以下是当前的代码:
print __doc__import numpy as npfrom sklearn.cluster import AffinityPropagationfrom sklearn import metricsfrom sklearn.datasets.samples_generator import make_blobsimport csv##############################################################################f = open('ha-sequences-sample-distmat2.csv', 'rU')csvreader = csv.reader(f)sequence_names = []distance_matrix = []full_data = []for row in csvreader:# print row sequence_names.append(row[0]) distance_matrix.append(row[1:]) full_data.append(row)f.close()distmat = np.array([row for row in distance_matrix]).astype(np.float)# print distmataffinity_matrix = np.array([1 - row for row in distmat]).astype(np.float)full_matrix = zip(sequence_names, affinity_matrix)# print affinity_matrix, sequence_names############################################################################### Compute Affinity Propagationaf = AffinityPropagation(affinity='precomputed').fit(affinity_matrix)cluster_centers_indices = af.cluster_centers_indices_labels = af.labels_n_clusters_ = len(cluster_centers_indices)print 'Estimated number of clusters: %d' % n_clusters_print "Homogeneity: %0.3f" % metrics.homogeneity_score(sequence_names, labels)print "Completeness: %0.3f" % metrics.completeness_score(sequence_names, labels)print "V-measure: %0.3f" % metrics.v_measure_score(sequence_names, labels)print "Adjusted Rand Index: %0.3f" % \ metrics.adjusted_rand_score(sequence_names, labels)print("Adjusted Mutual Information: %0.3f" % metrics.adjusted_mutual_info_score(sequence_names, labels))print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(affinity_matrix, labels, metric='sqeuclidean'))############################################################################### Plot resultimport pylab as plfrom itertools import cyclepl.close('all')pl.figure(1)pl.clf()colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')for k, col in zip(range(n_clusters_), colors): class_members = labels == k cluster_center = affinity_matrix[cluster_centers_indices[k]] pl.plot(affinity_matrix[class_members, 0], affinity_matrix[class_members, 1], col + '.') pl.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14) for x in affinity_matrix[class_members]: pl.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)pl.title('Estimated number of clusters: %d' % n_clusters_)pl.show()
我遇到的问题是:我无法找出如何输出与每个簇对应的序列名称。如果我能在shell中输出聚集在一起的序列,并在图上显示簇编号,那就最好了,但即使我在图上不显示任何东西,也可以接受。
有人知道如何做到这一点吗?
回答:
你有序列名称列表(sequence_names)和一个簇标签数组(af.labels_)。所以你可以遍历簇标签数组,并保持从簇标签到序列名称列表的映射。例如
# 为了简单起见,假设名称和簇标签是预定义的sequence_names = ["a", "b", "c", "d"]labels = [0,1,1,0]from collections import defaultdictclusternames = defaultdict(list)for i, label in enumerate(labels): clusternames[label].append(sequence_names[i])# clusternames现在保存了从簇标签到序列名称列表的映射# 打印标签和对应的序列名称列表for k, v in clusternames.items(): print k, v