如何对包含标签和概率的元组列表进行聚类？ – python

我有一堆文本，这些文本被分类到不同的类别中，然后每个文档被标记为0、1或2，并为每个标签分配一个概率。

[ "this is a foo bar",  "bar bar black sheep",  "sheep is an animal"  "foo foo bar bar"  "bar bar sheep sheep" ]

前一个工具在管道中返回一个包含元组列表的列表，每个外部列表中的元素相当于一个文档。我只能处理这样一个事实，即我知道每个文档被标记为0、1或2及其相应的概率，如下所示：

[ [(0,0.3), (1,0.5), (2,0.1)],  [(0,0.5), (1,0.3), (2,0.3)],  [(0,0.4), (1,0.4), (2,0.5)],  [(0,0.3), (1,0.7), (2,0.2)],  [(0,0.2), (1,0.6), (2,0.1)] ]

我需要查看每个元组列表中哪个标签的概率最高，并实现以下结果：

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,  [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,  [[(0,0.4), (1,0.4), (2,0.5)]] ]

另一个例子：

[in]:

[ [(0,0.7), (1,0.2), (2,0.4)],  [(0,0.5), (1,0.9), (2,0.3)],  [(0,0.3), (1,0.8), (2,0.4)],  [(0,0.8), (1,0.2), (2,0.2)],  [(0,0.1), (1,0.7), (2,0.5)] ]

[out]:

 [[[(0,0.7), (1,0.2), (2,0.4)], [(0,0.8), (1,0.2), (2,0.2)]] , [[(0,0.5), (1,0.9), (2,0.3)], [(0,0.1), (1,0.7), (2,0.5)], [(0,0.3), (1,0.8), (2,0.4)]] , []]

注意： 当数据到达我处理的管道部分时，我无法访问原始文本。

如何对包含标签和概率的元组列表进行聚类？是否有numpy、scipy、sklearn或任何Python机器学习套件可以做到这一点？甚至NLTK也可以吗？

假设聚类的数量是固定的，但聚类的大小不是固定的。

我只尝试过找到质心的最大值，但这只能给我每个聚类中的第一个值：

instream = [ [(0,0.3), (1,0.5), (2,0.1)],                        [(0,0.5), (1,0.3), (2,0.3)],                        [(0,0.4), (1,0.4), (2,0.5)],                        [(0,0.3), (1,0.7), (2,0.2)],                        [(0,0.2), (1,0.6), (2,0.1)] ]# 找到质心。  c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]print instream[c1_centroid]print instream[c2_centroid]print instream[c2_centroid]

[out] (每个聚类中的顶部元素)：

[(0, 0.5), (1, 0.3), (2, 0.3)][(0, 0.3), (1, 0.7), (2, 0.2)][(0, 0.3), (1, 0.7), (2, 0.2)]

回答：

如果我理解正确的话，这就是你想要的。

import numpy as npN_TYPES = 3instream = [ [(0,0.3), (1,0.5), (2,0.1)],             [(0,0.5), (1,0.3), (2,0.3)],             [(0,0.4), (1,0.4), (2,0.5)],             [(0,0.3), (1,0.7), (2,0.2)],             [(0,0.2), (1,0.6), (2,0.1)] ]instream = np.array(instream)# 这会移除文档标签，因为我们只考虑概率values = [map(lambda x: x[1], doc) for doc in instream]# 通过使用最大概率确定每个文档的聚类belongs_to = map(lambda x: np.argmax(x), values)belongs_to = np.array(belongs_to)# 构建指向instream的聚类索引cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]# 应用索引以获得完整输出out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]

输出 out：

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]], [[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]],  [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]],  [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]], [[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]

我使用了numpy数组，因为它们支持很好的搜索和索引。例如，表达式(belongs_to == 1).nonzero()[0]返回belongs_to数组中值为1的索引数组。索引的例子是instream[cluster_indices[2]]。

学技术

如何对包含标签和概率的元组列表进行聚类？ – python

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复