我想从采样数据中构建一个网格。我可以使用机器学习的聚类算法,比如k-means,但我想限制中心点大致均匀分布。
我已经想出了一种使用scikit-learn最近邻搜索的方法:随机选择一个点,删除半径r内的所有点,然后重复。这个方法效果不错,但我想知道是否有人有更好的(更快的)方法来做这件事。
根据评论,我尝试了两种替代方法,其中一种速度明显变慢,另一种速度大致相同…
方法0(我的第一次尝试):
def get_centers0(X, r): N = X.shape[0] D = X.shape[1] grid = np.zeros([0,D]) nearest = near.NearestNeighbors(radius = r, algorithm = 'auto') while N > 0: nearest.fit(X) x = X[int(random()*N), :] _, del_x = nearest.radius_neighbors(x) X = np.delete(X, del_x[0], axis = 0) grid = np.vstack([grid, x]) N = X.shape[0] return grid
方法1(使用预计算图):
def get_centers1(X, r): N = X.shape[0] D = X.shape[1] grid = np.zeros([0,D]) nearest = near.NearestNeighbors(radius = r, algorithm = 'auto') nearest.fit(X) graph = nearest.radius_neighbors_graph(X) #这种方法在进行任何“修剪”之前就已经非常慢
方法2:
def get_centers2(X, r, k): N = X.shape[0] D = X.shape[1] k = k grid = np.zeros([0,D]) nearest = near.NearestNeighbors(radius = r, algorithm = 'auto') while N > 0: nearest.fit(X) x = X[np.random.randint(0,N,k), :] #min_dist = near.NearestNeighbors().fit(x).kneighbors(x, n_neighbors = 1, return_distance = True) min_dist = dist(x, k, 2, np.ones(k)) # 其中dist是一个用cython编译的函数 x = x[min_dist < 0.1,:] _, del_x = nearest.radius_neighbors(x) X = np.delete(X, del_x[0], axis = 0) grid = np.vstack([grid, x]) N = X.shape[0] return grid
运行这些方法如下:
N = 50000r = 0.1x1 = np.random.rand(N)x2 = np.random.rand(N)X = np.vstack([x1, x2]).Ttic = time.time()grid0 = get_centers0(X, r)toc = time.time()print 'Method 0: ' + str(toc - tic)tic = time.time()get_centers1(X, r)toc = time.time()print 'Method 1: ' + str(toc - tic)tic = time.time()grid2 = get_centers2(X, r)toc = time.time()print 'Method 1: ' + str(toc - tic)
方法0和方法2的速度大致相同…
Method 0: 0.840130090714Method 1: 2.23365592957Method 2: 0.774812936783
回答:
我提出了一种非常简单的方法,比我之前的尝试效率高得多。
这个方法只是简单地遍历数据集,只有当当前点与所有现有中心点的距离大于r时,才将其添加到网格点列表中。这种方法比我之前的尝试快了大约20倍。因为没有涉及外部库,我可以全部在cython中运行…
@cython.boundscheck(False)@cython.wraparound(False)@cython.nonecheck(False)def get_centers_fast(np.ndarray[DTYPE_t, ndim = 2] x, double radius): cdef int N = x.shape[0] cdef int D = x.shape[1] cdef int m = 1 cdef np.ndarray[DTYPE_t, ndim = 2] xc = np.zeros([10000, D]) cdef double r = 0 cdef double r_min = 10 cdef int i, j, k for k in range(D): xc[0,k] = x[0,k] for i in range(1, N): r_min = 10 for j in range(m): r = 0 for k in range(D): r += (x[i, k] - xc[j, k])**2 r = r**0.5 if r < r_min: r_min = r if r_min > radius: m = m + 1 for k in range(D): xc[m - 1,k] = x[i,k] nonzero = np.nonzero(xc[:,0])[0] xc = xc[nonzero,:] return xc
运行这些方法如下:
N = 40000r = 0.1x1 = np.random.normal(size = N)x1 = (x1 - min(x1)) / (max(x1)-min(x1))x2 = np.random.normal(size = N)x2 = (x2 - min(x2)) / (max(x2)-min(x2))X = np.vstack([x1, x2]).Ttic = time.time()grid0 = gt.get_centers0(X, r)toc = time.time()print 'Method 0: ' + str(toc - tic)tic = time.time()grid2 = gt.get_centers2(X, r, 10)toc = time.time()print 'Method 2: ' + str(toc - tic)tic = time.time()grid3 = gt.get_centers_fast(X, r)toc = time.time()print 'Method 3: ' + str(toc - tic)
新方法的速度提高了大约20倍。如果我提前停止循环(例如,如果连续k次迭代未能产生新的中心点),它可能会变得更快。
Method 0: 0.219595909119Method 2: 0.191949129105Method 3: 0.0127329826355