如何对包含标签和概率的元组列表进行聚类? – python

我有一堆文本,这些文本被分类到不同的类别中,然后每个文档被标记为0、1或2,并为每个标签分配一个概率。

[ "this is a foo bar",  "bar bar black sheep",  "sheep is an animal"  "foo foo bar bar"  "bar bar sheep sheep" ]

前一个工具在管道中返回一个包含元组列表的列表,每个外部列表中的元素相当于一个文档。我只能处理这样一个事实,即我知道每个文档被标记为0、1或2及其相应的概率,如下所示:

[ [(0,0.3), (1,0.5), (2,0.1)],  [(0,0.5), (1,0.3), (2,0.3)],  [(0,0.4), (1,0.4), (2,0.5)],  [(0,0.3), (1,0.7), (2,0.2)],  [(0,0.2), (1,0.6), (2,0.1)] ]

我需要查看每个元组列表中哪个标签的概率最高,并实现以下结果:

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,  [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,  [[(0,0.4), (1,0.4), (2,0.5)]] ]

另一个例子:

[in]:

[ [(0,0.7), (1,0.2), (2,0.4)],  [(0,0.5), (1,0.9), (2,0.3)],  [(0,0.3), (1,0.8), (2,0.4)],  [(0,0.8), (1,0.2), (2,0.2)],  [(0,0.1), (1,0.7), (2,0.5)] ]

[out]:

 [[[(0,0.7), (1,0.2), (2,0.4)], [(0,0.8), (1,0.2), (2,0.2)]] , [[(0,0.5), (1,0.9), (2,0.3)], [(0,0.1), (1,0.7), (2,0.5)], [(0,0.3), (1,0.8), (2,0.4)]] , []]

注意: 当数据到达我处理的管道部分时,我无法访问原始文本。

如何对包含标签和概率的元组列表进行聚类?是否有numpyscipysklearn或任何Python机器学习套件可以做到这一点?甚至NLTK也可以吗?

假设聚类的数量是固定的,但聚类的大小不是固定的。

我只尝试过找到质心的最大值,但这只能给我每个聚类中的第一个值:

instream = [ [(0,0.3), (1,0.5), (2,0.1)],                        [(0,0.5), (1,0.3), (2,0.3)],                        [(0,0.4), (1,0.4), (2,0.5)],                        [(0,0.3), (1,0.7), (2,0.2)],                        [(0,0.2), (1,0.6), (2,0.1)] ]# 找到质心。  c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]print instream[c1_centroid]print instream[c2_centroid]print instream[c2_centroid]

[out] (每个聚类中的顶部元素):

[(0, 0.5), (1, 0.3), (2, 0.3)][(0, 0.3), (1, 0.7), (2, 0.2)][(0, 0.3), (1, 0.7), (2, 0.2)]

回答:

如果我理解正确的话,这就是你想要的。

import numpy as npN_TYPES = 3instream = [ [(0,0.3), (1,0.5), (2,0.1)],             [(0,0.5), (1,0.3), (2,0.3)],             [(0,0.4), (1,0.4), (2,0.5)],             [(0,0.3), (1,0.7), (2,0.2)],             [(0,0.2), (1,0.6), (2,0.1)] ]instream = np.array(instream)# 这会移除文档标签,因为我们只考虑概率values = [map(lambda x: x[1], doc) for doc in instream]# 通过使用最大概率确定每个文档的聚类belongs_to = map(lambda x: np.argmax(x), values)belongs_to = np.array(belongs_to)# 构建指向instream的聚类索引cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]# 应用索引以获得完整输出out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]   

输出 out

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]], [[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]],  [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]],  [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]], [[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]

我使用了numpy数组,因为它们支持很好的搜索和索引。例如,表达式(belongs_to == 1).nonzero()[0]返回belongs_to数组中值为1的索引数组。索引的例子是instream[cluster_indices[2]]

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注