我有一堆文本,这些文本被分类到不同的类别中,然后每个文档被标记为0、1或2,并为每个标签分配一个概率。
[ "this is a foo bar", "bar bar black sheep", "sheep is an animal" "foo foo bar bar" "bar bar sheep sheep" ]
前一个工具在管道中返回一个包含元组列表的列表,每个外部列表中的元素相当于一个文档。我只能处理这样一个事实,即我知道每个文档被标记为0、1或2及其相应的概率,如下所示:
[ [(0,0.3), (1,0.5), (2,0.1)], [(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)], [(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)] ]
我需要查看每个元组列表中哪个标签的概率最高,并实现以下结果:
[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] , [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] , [[(0,0.4), (1,0.4), (2,0.5)]] ]
另一个例子:
[in]
:
[ [(0,0.7), (1,0.2), (2,0.4)], [(0,0.5), (1,0.9), (2,0.3)], [(0,0.3), (1,0.8), (2,0.4)], [(0,0.8), (1,0.2), (2,0.2)], [(0,0.1), (1,0.7), (2,0.5)] ]
[out]
:
[[[(0,0.7), (1,0.2), (2,0.4)], [(0,0.8), (1,0.2), (2,0.2)]] , [[(0,0.5), (1,0.9), (2,0.3)], [(0,0.1), (1,0.7), (2,0.5)], [(0,0.3), (1,0.8), (2,0.4)]] , []]
注意: 当数据到达我处理的管道部分时,我无法访问原始文本。
如何对包含标签和概率的元组列表进行聚类?是否有numpy
、scipy
、sklearn
或任何Python机器学习套件可以做到这一点?甚至NLTK
也可以吗?
假设聚类的数量是固定的,但聚类的大小不是固定的。
我只尝试过找到质心的最大值,但这只能给我每个聚类中的第一个值:
instream = [ [(0,0.3), (1,0.5), (2,0.1)], [(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)], [(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)] ]# 找到质心。 c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]print instream[c1_centroid]print instream[c2_centroid]print instream[c2_centroid]
[out]
(每个聚类中的顶部元素):
[(0, 0.5), (1, 0.3), (2, 0.3)][(0, 0.3), (1, 0.7), (2, 0.2)][(0, 0.3), (1, 0.7), (2, 0.2)]
回答:
如果我理解正确的话,这就是你想要的。
import numpy as npN_TYPES = 3instream = [ [(0,0.3), (1,0.5), (2,0.1)], [(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)], [(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)] ]instream = np.array(instream)# 这会移除文档标签,因为我们只考虑概率values = [map(lambda x: x[1], doc) for doc in instream]# 通过使用最大概率确定每个文档的聚类belongs_to = map(lambda x: np.argmax(x), values)belongs_to = np.array(belongs_to)# 构建指向instream的聚类索引cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]# 应用索引以获得完整输出out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]
输出 out
:
[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]], [[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]], [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]], [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]], [[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]
我使用了numpy
数组,因为它们支持很好的搜索和索引。例如,表达式(belongs_to == 1).nonzero()[0]
返回belongs_to
数组中值为1
的索引数组。索引的例子是instream[cluster_indices[2]]
。