我尝试按照这篇文章使用NumPy实现高斯混合模型和最大似然优化来实现用于GMM的EM算法,但未成功,实现代码如下:
import numpy as npdef PDF(data, means, variances): return 1/(np.sqrt(2 * np.pi * variances) + eps) * np.exp(-1/2 * (np.square(data - means) / (variances + eps)))def EM_GMM(data, k, iterations): weights = np.ones((k, 1)) / k # shape=(k, 1) means = np.random.choice(data, k)[:, np.newaxis] # shape=(k, 1) variances = np.random.random_sample(size=k)[:, np.newaxis] # shape=(k, 1) data = np.repeat(data[np.newaxis, :], k, 0) # shape=(k, n) for step in range(iterations): # Expectation step likelihood = PDF(data, means, np.sqrt(variances)) # shape=(k, n) # Maximization step b = likelihood * weights # shape=(k, n) b /= np.sum(b, axis=1)[:, np.newaxis] + eps # updage means, variances, and weights means = np.sum(b * data, axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps) variances = np.sum(b * np.square(data - means), axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps) weights = np.mean(b, axis=1)[:, np.newaxis] return means, variances
当我在一个一维时间序列数据集上运行该算法时,设k等于3,得到的输出如下所示:
array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 3.05053810e-003, 2.36989898e-025, 2.36989898e-025, 1.32797395e-136, 6.91134950e-031, 5.47347807e-001, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 2.25849208e-064, 0.00000000e+000, 1.61228562e-303, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 3.94387272e-242, 1.13078186e+000, 2.53108878e-001, 5.33548114e-001, 9.14920432e-001, 2.07015697e-013, 4.45250680e-038, 1.43000602e+000, 1.28781615e+000, 1.44821615e+000, 1.18186109e+000, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 2.47382844e-039, 0.00000000e+000, 2.09150855e-200, 0.00000000e+000, 0.00000000e+000], [5.93203066e-002, 1.01647068e+000, 5.99299162e-001, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 2.14690238e-010, 2.49337135e-191, 5.10499986e-001, 9.32658804e-001, 1.21148135e+000, 1.13315278e+000, 2.50324069e-237, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 1.73966953e-125, 2.53559290e-275, 1.42960975e-065, 7.57552338e-001], [0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 3.05053810e-003, 2.36989898e-025, 2.36989898e-025, 1.32797395e-136, 6.91134950e-031, 5.47347807e-001, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 1.44637007e+000, 2.25849208e-064, 0.00000000e+000, 1.61228562e-303, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 3.94387272e-242, 1.13078186e+000, 2.53108878e-001, 5.33548114e-001, 9.14920432e-001, 2.07015697e-013, 4.45250680e-038, 1.43000602e+000, 1.28781615e+000, 1.44821615e+000, 1.18186109e+000, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 3.21610659e-002, 2.47382844e-039, 0.00000000e+000, 2.09150855e-200, 0.00000000e+000, 0.00000000e+000]])
我认为这些输出是错误的,因为它们是两个向量,其中一个代表means
值,另一个代表variances
值。让我怀疑实现的关键点是输出中大部分值为0.00000000e+000
,这似乎不需要可视化输出。此外,输入数据是时间序列数据。我已经检查并多次跟踪代码,但没有发现任何错误。
这是我的输入数据:
[25.31 , 24.31 , 24.12 , 43.46 , 41.48666667, 41.48666667, 37.54 , 41.175 , 44.81 , 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 39.71 , 26.69 , 34.15 , 24.94 , 24.75 , 24.56 , 24.38 , 35.25 , 44.62 , 44.94 , 44.815 , 44.69 , 42.31 , 40.81 , 44.38 , 44.56 , 44.44 , 44.25 , 43.66666667, 43.66666667, 43.66666667, 43.66666667, 43.66666667, 40.75 , 32.31 , 36.08 , 30.135 , 24.19 ]
我想知道是否有通过numpy
或SciKit-learn
优雅地实现它的方法。任何帮助都将不胜感激。
更新以下是当前输出和预期输出:
回答:
正如我在评论中提到的,我认为关键点在于means
的初始化。根据sklearn高斯混合的默认实现,我从随机初始化改为使用KMeans方法进行初始化。
...(代码内容略)
这种方法似乎能够更一致地产生期望的输出:
...(代码内容略)
最后,我们可以看到纯粹的随机初始化会生成不同的结果;让我们看看生成的means
值:
...(代码内容略)
可以看到这些结果有多大不同,在某些情况下,结果的均值是常数,这意味着初始化选择了三个相似的值,并且在迭代过程中变化不大。在EM_GMM
函数中添加一些打印语句可以更清楚地看到这一点。