假设我手头有以下数据集(谷歌云端硬盘链接),
最左边的列代表卡片的类型/类别(梅花、黑桃、方块、红心)。其余列是特征(胡矩)。
4.00000000e+000 1.81685834e-001 9.69817396e-006 1.38999809e-003 4.53935830e-006 -3.00925971e-010 -1.02459512e-008 -1.98644904e-010 4.00000000e+000 1.84243083e-001 1.38222459e-005 1.40735374e-003 5.54632513e-006 -4.43889193e-010 -1.56489028e-008 -2.07550859e-010 4.00000000e+000 1.82590649e-001 5.79561254e-005 1.39540810e-003 5.08169772e-006 -1.84162373e-010 -6.28655267e-009 -3.86265224e-010 ... ... ... ... ... ... ... ... ... 3.00000000e+000 1.82603791e-001 9.40113955e-005 2.03910312e-003 8.28822917e-006 -1.07466686e-009 -7.79983939e-008 7.79123931e-011 3.00000000e+000 1.83689464e-001 1.04402426e-005 2.03314034e-003 8.07647097e-006 -1.01550111e-009 -1.72512940e-008 1.99657322e-010 3.00000000e+000 1.80537920e-001 3.57786643e-005 1.76044988e-003 4.93065630e-006 -4.50792164e-010 -2.52193544e-008 8.83931179e-011 ... ... ... ... ... ... ... ... ... 2.00000000e+000 1.69366341e-001 1.04327615e-003 1.34561560e-006 8.41412130e-008 9.60997904e-015 2.07709872e-009 -2.66313560e-014 2.00000000e+000 1.70623294e-001 1.52567078e-003 2.33145414e-005 1.91976774e-006 1.28281112e-011 7.49218536e-008 -6.30393351e-013 2.00000000e+000 1.71039727e-001 1.75199006e-003 3.56406516e-007 2.25222892e-008 -1.80796663e-016 8.75703034e-010 -2.00974686e-015 ... ... ... ... ... ... ... ... ... 1.00000000e+000 2.03297227e-001 4.88342633e-004 2.30244914e-003 2.76274577e-006 -1.62641080e-010 -5.06416340e-008 -1.48662421e-010 1.00000000e+000 2.02575326e-001 3.16058139e-004 2.03933434e-003 4.34776729e-007 -1.26636446e-011 -7.63543121e-009 2.69021091e-012 1.00000000e+000 2.02239287e-001 3.21962233e-004 1.94963577e-003 1.92362659e-006 -2.34173299e-011 -1.78153951e-008 1.15452477e-010 1.00000000e+000 2.02709157e-001 2.28613647e-004 1.89761073e-003 1.09923103e-006 1.25239064e-011 -3.87194855e-009 4.86166479e-011 1.00000000e+000 1.99640647e-001 1.80163318e-004 1.66091127e-003 3.40914582e-007 6.26687530e-012 7.47151809e-010 5.15120878e-012 ... ... ... ... ... ... ... ... ... 4.00000000e+000 1.94974773e-001 1.02770938e-003 3.32021924e-005 7.56951250e-005 -3.21487967e-009 2.42373008e-006 -2.01613839e-009 4.00000000e+000 1.91031757e-001 1.04421581e-003 1.30233680e-005 5.48067243e-005 -1.41634644e-009 1.76666840e-006 3.71433852e-010 4.00000000e+000 1.94861863e-001 9.86215578e-004 4.27892747e-005 7.04495953e-005 -3.50245985e-009 2.21146739e-006 -1.64137532e-009 ... ... ... ... ... ... ... ... ...
- 对于这种类型的数据,哪种异常值检测和删除方法最适合?
- 如何从该数据集中检测和删除异常值?
编辑:
我的老师写了这些,
load train.txtload test.txt% 比较均值和中位值[mean(train); median(train)]% 你可以在这里比较在训练集和测试集上计算的不同参数 - 它们应该大致相同% 绘制直方图 - 首先显示标签(1..4)以检查直方图绘制% 我们可以使用hist来绘制单一维度hist(train(:,1))% 现在绘制第一个特征的直方图hist(train(:,2))% 绘制前两个特征的二维图% 在每次修改训练集后重复绘图是个好主意plot2features(train, 2, 3);% 找到异常值所在的行[mv mi] = max(train)% 从训练集中删除异常值train(186,:)=[];% 找到异常值所在的行[mv mi] = min(train)% 从训练集中删除异常值train(641,:)=[];
我无法理解他做了什么以及为什么要这样做。
回答:
这是我用于异常值检测和删除的MATLAB代码,
function mat = removeOutlier(train) [mv mi] = max(train(:,2:end)); x = mode(mv); train(x, :)=[]; [mv mi] = min(train(:,2:end)); x = mode(mv); train(x, :)=[]; mat = train;