处理逻辑回归中的NaN（缺失）值 – 最佳实践？

我正在处理一组患者信息数据集，并尝试使用MATLAB从数据中计算倾向得分。在移除了许多缺失值的特征后，我仍然留下了几个缺失（NaN）值。

由于这些缺失值，我在尝试使用以下MATLAB代码（来自Andrew Ng的Coursera机器学习课程）进行逻辑回归时，收到了错误，因为我的成本函数和梯度向量的值变成了NaN：

[m, n] = size(X);X = [ones(m, 1) X];    initial_theta = ones(n+1, 1);[cost, grad] = costFunction(initial_theta, X, y);options = optimset('GradObj', 'on', 'MaxIter', 400);[theta, cost] = ...    fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);

注意：sigmoid和costfunction是我为了整体使用便利而创建的有效函数。

如果我将所有NaN值替换为1或0，计算可以顺利进行。然而，我不确定这是否是处理此问题的正确方法，我也在考虑一般情况下应该选择什么样的替换值，以获得处理缺失数据的逻辑回归的最佳结果。使用特定数字（0或1或其他）来替换数据中的缺失值是否有任何优点/缺点？

注意：我还将所有特征值标准化为0-1范围内。

关于这个问题，任何见解都将不胜感激。谢谢

回答：

正如之前指出的，这是一个人们在任何编程平台上都会遇到的问题。它被称为“缺失数据填补”。

将所有缺失值强制设为特定数字肯定有缺点。根据你的数据分布情况，这可能非常严重，例如，在一个二进制稀疏数据中，将所有缺失值设为1，而该数据中零的数量比一多。

幸运的是，MATLAB有一个名为knnimpute的函数，它通过其最近的邻居来估计缺失的数据点。

根据我的经验，我经常发现knnimpute很有用。然而，当数据中有太多缺失点时，它可能会不足；一个缺失点的邻居也可能是不完整的，从而导致不准确的估计。下面，我找到了一个解决方案；它从填补最不完整的列开始，（可选地）为邻居设置一个安全的预定义距离。我希望这对你有帮助。

function data = dnnimpute(data,distCutoff,option,distMetric)% data = dnnimpute(data,distCutoff,option,distMetric)%%   Distance-based nearest neighbor imputation that impose a distance%     cutoff to determine nearest neighbors, i.e., avoids those samples %     that are more distant than the distCutoff argument.%%   Imputes missing data coded by "NaN" starting from the covarites %     (columns) with the least number of missing data. Then it continues by %     including more (complete) covariates in the calculation of pair-wise %     distances.%%   option, %       'median'      - Median of the nearest neighboring values%       'weighted'    - Weighted average of the nearest neighboring values%       'default'     - Unweighted average of the nearest neighboring values%%   distMetric,%       'euclidean'   - Euclidean distance (default)%       'seuclidean'  - Standardized Euclidean distance. Each coordinate%                       difference between rows in X is scaled by dividing%                       by the corresponding element of the standard%                       deviation S=NANSTD(X). To specify another value for%                       S, use D=pdist(X,'seuclidean',S).%       'cityblock'   - City Block distance%       'minkowski'   - Minkowski distance. The default exponent is 2. To%                       specify a different exponent, use%                       D = pdist(X,'minkowski',P), where the exponent P is%                       a scalar positive value.%       'chebychev'   - Chebychev distance (maximum coordinate difference)%       'mahalanobis' - Mahalanobis distance, using the sample covariance%                       of X as computed by NANCOV. To compute the distance%                       with a different covariance, use%                       D =  pdist(X,'mahalanobis',C), where the matrix C%                       is symmetric and positive definite.%       'cosine'      - One minus the cosine of the included angle%                       between observations (treated as vectors)%       'correlation' - One minus the sample linear correlation between%                       observations (treated as sequences of values).%       'spearman'    - One minus the sample Spearman's rank correlation%                       between observations (treated as sequences of values).%       'hamming'     - Hamming distance, percentage of coordinates%                       that differ%       'jaccard'     - One minus the Jaccard coefficient, the%                       percentage of nonzero coordinates that differ%       function      - A distance function specified using @, for%                       example @DISTFUN.%  if nargin < 3    option = 'mean';endif nargin < 4    distMetric = 'euclidean';endnanVals = isnan(data);nanValsPerCov = sum(nanVals,1);noNansCov = nanValsPerCov == 0;if isempty(find(noNansCov, 1))    [~,leastNans] = min(nanValsPerCov);    noNansCov(leastNans) = true;    first = data(nanVals(:,noNansCov),:);    nanRows = find(nanVals(:,noNansCov)==true); i = 1;    for row = first'        data(nanRows(i),noNansCov) = mean(row(~isnan(row)));        i = i+1;    endendnSamples = size(data,1);if nargin < 2    dataNoNans = data(:,noNansCov);    distances = pdist(dataNoNans);    distCutoff = min(distances);end[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');imputeCols = idxCovMissDat(stdCovMissDat>0);% Impute starting from the cols (covariates) with the least number of % missing data. for c = reshape(imputeCols,1,length(imputeCols))        imputeRows = 1:nSamples;    imputeRows = imputeRows(nanVals(:,c));       for r = reshape(imputeRows,1,length(imputeRows))        % Calculate distances        distR = inf(nSamples,1);        %        noNansCov_r = find(isnan(data(r,:))==0);        noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);        %        for i = find(nanVals(:,c)'==false)            distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);                    end        tmp = min(distR(distR>0));                % Impute the missing data at sample r of covariate c        switch option            case 'weighted'                data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));            case 'median'                data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);            case 'mean'                data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);        end        % The missing data in sample r is imputed. Update the sample         % indices of c which are imputed.         nanVals(r,c) = false;      end        fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));end

学技术

处理逻辑回归中的NaN（缺失）值 – 最佳实践？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复