我正在处理一组患者信息数据集,并尝试使用MATLAB从数据中计算倾向得分。在移除了许多缺失值的特征后,我仍然留下了几个缺失(NaN)值。
由于这些缺失值,我在尝试使用以下MATLAB代码(来自Andrew Ng的Coursera机器学习课程)进行逻辑回归时,收到了错误,因为我的成本函数和梯度向量的值变成了NaN:
[m, n] = size(X);X = [ones(m, 1) X]; initial_theta = ones(n+1, 1);[cost, grad] = costFunction(initial_theta, X, y);options = optimset('GradObj', 'on', 'MaxIter', 400);[theta, cost] = ... fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);
注意:sigmoid和costfunction是我为了整体使用便利而创建的有效函数。
如果我将所有NaN值替换为1或0,计算可以顺利进行。然而,我不确定这是否是处理此问题的正确方法,我也在考虑一般情况下应该选择什么样的替换值,以获得处理缺失数据的逻辑回归的最佳结果。使用特定数字(0或1或其他)来替换数据中的缺失值是否有任何优点/缺点?
注意:我还将所有特征值标准化为0-1范围内。
关于这个问题,任何见解都将不胜感激。谢谢
回答:
正如之前指出的,这是一个人们在任何编程平台上都会遇到的问题。它被称为“缺失数据填补”。
将所有缺失值强制设为特定数字肯定有缺点。根据你的数据分布情况,这可能非常严重,例如,在一个二进制稀疏数据中,将所有缺失值设为1,而该数据中零的数量比一多。
幸运的是,MATLAB有一个名为knnimpute
的函数,它通过其最近的邻居来估计缺失的数据点。
根据我的经验,我经常发现knnimpute
很有用。然而,当数据中有太多缺失点时,它可能会不足;一个缺失点的邻居也可能是不完整的,从而导致不准确的估计。下面,我找到了一个解决方案;它从填补最不完整的列开始,(可选地)为邻居设置一个安全的预定义距离。我希望这对你有帮助。
function data = dnnimpute(data,distCutoff,option,distMetric)% data = dnnimpute(data,distCutoff,option,distMetric)%% Distance-based nearest neighbor imputation that impose a distance% cutoff to determine nearest neighbors, i.e., avoids those samples % that are more distant than the distCutoff argument.%% Imputes missing data coded by "NaN" starting from the covarites % (columns) with the least number of missing data. Then it continues by % including more (complete) covariates in the calculation of pair-wise % distances.%% option, % 'median' - Median of the nearest neighboring values% 'weighted' - Weighted average of the nearest neighboring values% 'default' - Unweighted average of the nearest neighboring values%% distMetric,% 'euclidean' - Euclidean distance (default)% 'seuclidean' - Standardized Euclidean distance. Each coordinate% difference between rows in X is scaled by dividing% by the corresponding element of the standard% deviation S=NANSTD(X). To specify another value for% S, use D=pdist(X,'seuclidean',S).% 'cityblock' - City Block distance% 'minkowski' - Minkowski distance. The default exponent is 2. To% specify a different exponent, use% D = pdist(X,'minkowski',P), where the exponent P is% a scalar positive value.% 'chebychev' - Chebychev distance (maximum coordinate difference)% 'mahalanobis' - Mahalanobis distance, using the sample covariance% of X as computed by NANCOV. To compute the distance% with a different covariance, use% D = pdist(X,'mahalanobis',C), where the matrix C% is symmetric and positive definite.% 'cosine' - One minus the cosine of the included angle% between observations (treated as vectors)% 'correlation' - One minus the sample linear correlation between% observations (treated as sequences of values).% 'spearman' - One minus the sample Spearman's rank correlation% between observations (treated as sequences of values).% 'hamming' - Hamming distance, percentage of coordinates% that differ% 'jaccard' - One minus the Jaccard coefficient, the% percentage of nonzero coordinates that differ% function - A distance function specified using @, for% example @DISTFUN.% if nargin < 3 option = 'mean';endif nargin < 4 distMetric = 'euclidean';endnanVals = isnan(data);nanValsPerCov = sum(nanVals,1);noNansCov = nanValsPerCov == 0;if isempty(find(noNansCov, 1)) [~,leastNans] = min(nanValsPerCov); noNansCov(leastNans) = true; first = data(nanVals(:,noNansCov),:); nanRows = find(nanVals(:,noNansCov)==true); i = 1; for row = first' data(nanRows(i),noNansCov) = mean(row(~isnan(row))); i = i+1; endendnSamples = size(data,1);if nargin < 2 dataNoNans = data(:,noNansCov); distances = pdist(dataNoNans); distCutoff = min(distances);end[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');imputeCols = idxCovMissDat(stdCovMissDat>0);% Impute starting from the cols (covariates) with the least number of % missing data. for c = reshape(imputeCols,1,length(imputeCols)) imputeRows = 1:nSamples; imputeRows = imputeRows(nanVals(:,c)); for r = reshape(imputeRows,1,length(imputeRows)) % Calculate distances distR = inf(nSamples,1); % noNansCov_r = find(isnan(data(r,:))==0); noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0); % for i = find(nanVals(:,c)'==false) distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric); end tmp = min(distR(distR>0)); % Impute the missing data at sample r of covariate c switch option case 'weighted' data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp))); case 'median' data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1); case 'mean' data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1); end % The missing data in sample r is imputed. Update the sample % indices of c which are imputed. nanVals(r,c) = false; end fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));end