我在编写一个k最近邻分类器时遇到了问题,数组操作花费的时间太长了。
def classify(k, train_data, target): num_rows = train_data.shape[0] num_cols = train_data.shape[1] distances = [] candidates = [0] * 10 for i, row in enumerate(train_data): dist = euclidean_dist(target[:num_cols - 1], row[:num_cols - 1]) # 速度慢 distances.append((dist, row[num_cols - 1])) distances.sort(key=lambda tup: tup[0]) distances = distances[:k] for i, d in enumerate(distances): candidates[d[1]] += 1 return candidates.index(max(candidates))def euclidean_dist(x1, x2): assert(len(x1) == len(x2)) result = 0 pdb.set_trace() for i in range(len(x1)): # 罪魁祸首,x1 和 x2 都是长度为256的列表 result += math.pow(x1[i] - x2[i], 2) result = math.sqrt(result) return result
我在上面的代码中标注了问题发生的地方,欢迎提供任何加速的建议。
回答:
看起来你只需要欧几里得距离/2范数,可以通过numpy(导入为np
)高效地获得:
def euclidean_dist2(x1, x2): assert(len(x1) == len(x2)) x1 = np.array(x1) x2 = np.array(x2) norm = np.linalg.norm(x1-x2) return normprint euclidean_dist2([1,2],[4,7])
这将给你5.83095189485,与你原来的函数结果相同,但已被向量化
分解来看,你只是在进行元素-wise的差运算,然后将结果向量乘以自身(平方),求和,最后对和进行开方运算:
def euclidean_dist3(x1, x2): assert(len(x1) == len(x2)) x1 = np.array(x1) x2 = np.array(x2) diff = x1 - x2 squared = np.transpose(diff) * diff summed = sum(squared) norm = np.sqrt(summed) return norm
换句话说,你只是在计算差向量与自身的点积:
def euclidean_dist4(x1, x2): assert(len(x1) == len(x2)) x1 = np.array(x1) x2 = np.array(x2) diff = x1 - x2 dot = np.dot(diff, diff) norm = np.sqrt(dot) return norm
不同的方法可以达到同样的效果