scikit kmeans的成本/惯性不准确

我想获取k-means的成本（在scikit kmeans中称为inertia）。提醒一下：

成本是每个点到最近的聚类中心的平方距离之和。

我在scikit的成本计算（’inertia’）和自己简单计算成本的方法之间发现了一个奇怪的差异

请看以下示例：

p = np.random.rand(1000000,2)from sklearn.cluster import KMeansa = KMeans(n_clusters=3).fit(p)print a.inertia_ , "****"means = a.cluster_centers_s = 0for x in p:    best = float("inf")    for y in means:        if np.linalg.norm(x-y)**2 < best:            best = np.linalg.norm(x-y)**2    s += bestprint s, "*****"

在我的运行中，输出结果是：

66178.4232156 ****66173.7928716 *****

在我的数据集上，结果差异更大（20%的差异）。
这是scikit实现中的一个错误吗？

回答：

首先 – 这似乎不是一个错误（但确实存在明显的不一致性）。为什么会这样呢？你需要仔细查看代码实际在做什么。对于这个通用目的，它调用了_k_means.pyx中的cython代码

（第577-578行）

    inertia = _k_means._assign_labels_array(        X, x_squared_norms, centers, labels, distances=distances)

它所做的基本上与你的代码相同，但是…在C语言中使用了双精度浮点数。所以这可能只是一个数值问题吗？让我们测试你的代码，但这次使用清晰的聚类结构（因此没有可能被分配到多个中心的点 – 这取决于数值精度）。

import numpy as npfrom sklearn.metrics import euclidean_distancesp = np.random.rand(1000000,2)p[:p.shape[0]/2, :] += 100 #我将一半的点移到很远的地方from sklearn.cluster import KMeansa = KMeans(n_clusters=2).fit(p) #改为两个聚类print a.inertia_ , "****"means = a.cluster_centers_s = 0for x in p:    best = float("inf")    for y in means:        d = (x-y).T.dot(x-y)        if d < best:            best = d    s += bestprint s, "*****"

结果

166805.190832 ****166805.190946 *****

这是有道理的。因此问题在于存在“接近边界”的样本，这些样本可能会根据算术精度被分配到多个聚类中。不幸的是，我无法准确追踪差异的来源。

有趣的是，实际上存在一个不一致性，因为inertia_字段是由Cython代码填充的，而.score调用的是NumPy的代码。因此如果你调用

print -a.score(p)

你将得到你的惯性值。

学技术

scikit kmeans的成本/惯性不准确

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复