核密度估计 Julia

我正在尝试实现一个核密度估计。然而，我的代码并未给出应有的答案。代码是用Julia编写的，但应该可以自解释。

这是算法：

$\$ f(x) = \frac{1}{n*h} * \sum_{i = 1}^n K(\frac{x - X_i}{h}) \$$

其中

$\$ K(u) = 0.5*I(|u| <= 1)\$ with \$ u = \frac{x - X_i}{h}\$$

因此，算法测试x与观测值X_i之间的距离是否小于1，该距离由某个常数因子（分箱宽度）加权。如果是，则为该值分配0.5 / (n * h)，其中n为观测值的数量。

这是我的实现：

#Kernel density function.#Purpose: estimate the probability density function (pdf)#of given observations#@param data: observations for which the pdf should be estimated#@return: returns an array with the estimated densities function kernelDensity(data)|   |   #Uniform kernel function. |   #@param x: Current x value|   #@param X_i: x value of observation i|   #@param width: binwidth|   #@return: Returns 1 if the absolute distance from|   #x(current) to x(observation) weighted by the binwidth|   #is less then 1. Else it returns 0.|  |   function uniformKernel(x, observation, width)|   |   u = ( x - observation ) / width|   |   abs ( u ) <= 1 ? 1 : 0|   end||   #number of observations in the data set |   n = length(data)||   #binwidth (set arbitraily to 0.1|   h = 0.1 |   |   #vector that stored the pdf|   res = zeros( Real, n )|   |   #counter variable for the loop |   counter = 0||   #lower and upper limit of the x axis|   start = floor(minimum(data))|   stop = ceil (maximum(data))||   #main loop|   #@linspace: divides the space from start to stop in n|   #equally spaced intervalls|   for x in linspace(start, stop, n) |   |   counter += 1|   |   for observation in data|   |   ||   |   |   #count all observations for which the kernel|   |   |   #returns 1 and mult by 0.5 because the|   |   |   #kernel computed the absolute difference which can be|   |   |   #either positive or negative|   |   |   res[counter] += 0.5 * uniformKernel(x, observation, h)|   |   end|   |   #devide by n times h|   |   res[counter] /= n * h|   end|   #return results|   resend#run function#@rand: generates 10 uniform random numbers between 0 and 1kernelDensity(rand(10))

返回的结果是：

> 0.0> 1.5> 2.5> 1.0> 1.5> 1.0> 0.0> 0.5> 0.5> 0.0

这些值的总和为：8.5（累积分布函数。应该为1。）

因此，有两个错误：

数值未正确缩放。每个数字应为当前值的大约十分之一。实际上，如果观测值的数量增加10^n（n = 1, 2, …），则累积分布函数也会增加10^n

例如：

> kernelDensity(rand(1000))> 953.53

它们加起来不是10（如果没有缩放错误的话，应该是1）。随着样本量的增加，错误变得更加明显：大约有5%的观测值未被包括在内。

我认为我是一比一地实现了公式，因此我真的不明白错误出在哪里。

回答：

我不是KDE的专家，所以请谨慎对待这些建议，但一个非常相似的（但速度更快！）代码实现将是：

function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T)  res = similar(data)  lb = minimum(data); ub = maximum(data)  for (i,x) in enumerate(linspace(lb, ub, size(data,1)))    for obs in data      res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0.    end    res[i] /= (n*h) end sum(res)end

如果我没记错的话，密度估计应该积分为1，也就是说我们期望kernelDensity(rand(100), 0.1)/100至少接近1。在上面的实现中，我达到了这个目标，误差在5%以内，但我们不知道0.1是否是最佳带宽（使用h=0.135我达到了0.1%以内的误差），而且已知均匀核的“效率”只有约93%。

无论如何，Julia中有一个非常好的核密度包，可以在这里找到这里，所以你可能应该直接使用Pkg.add("KernelDensity")，而不是尝试自己编写Epanechnikov核 🙂

学技术

核密度估计 Julia

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复