Home IT技术词列表聚类

词列表聚类

IT技术 xiaolong · 2025年5月22日 · 0 Comment

假设我有一个词的列表集合，例如

[['apple','banana'], ['apple','orange'], ['banana','orange'], ['rice','potatoes','orange'], ['potatoes','rice']]

这个集合实际上要大得多。我希望将通常一起出现的词聚集到同一个簇中。因此，在这个例子中，簇将是 ['apple', 'banana', 'orange'] 和 ['rice','potatoes']。
实现这种聚类的最佳方法是什么？

回答：

经过大量的搜索后，我发现实际上我不能使用聚类技术，因为我缺少可以对词进行聚类的特征变量。如果我制作一个表格，记录每个词与其他词一起出现的频率（实际上是笛卡尔积），这实际上是邻接矩阵，聚类在这上面效果并不好。

因此，我寻找的解决方案是图社区检测。我使用了igraph库（或Python的python-igraph包装器）来查找簇，结果运行得非常好且速度快。

更多信息：

相似问题：https://stats.stackexchange.com/questions/142297/finding-natural-groups-clusters-in-an-undirected-graph-over-several-undirect
图中社区检测的论文：https://arxiv.org/pdf/0906.0612.pdf
各种算法的基本描述：What are the differences between community detection algorithms in igraph?

cluster-analysis information-retrieval machine-learning python

发表回复取消回复