我正在编写一个Java程序,希望能够在给定数据集时计算熵、联合熵、条件熵等。相关类如下所示:
public class Entropy {private Frequency<String> iFrequency = new Frequency<String>();private Frequency<String> rFrequency = new Frequency<String>();Entropy(){ super();}public void setInterestedFrequency(List<String> interestedFrequency){ for(String s: interestedFrequency){ this.iFrequency.addValue(s); }}public void setReducingFrequency(List<String> reducingFrequency){ for(String s:reducingFrequency){ this.rFrequency.addValue(s); }}private double log(double num, int base){ return Math.log(num)/Math.log(base);}public double entropy(List<String> data){ double entropy = 0.0; double prob = 0.0; Frequency<String> frequency = new Frequency<String>(); for(String s:data){ frequency.addValue(s); } String[] keys = frequency.getKeys(); for(int i=0;i<keys.length;i++){ prob = frequency.getPct(keys[i]); entropy = entropy - prob * log(prob,2); } return entropy;}/** return conditional probability of P(interestedClass|reducingClass)* */public double conditionalProbability(List<String> interestedSet, List<String> reducingSet, String interestedClass, String reducingClass){ List<Integer> conditionalData = new LinkedList<Integer>(); if(iFrequency.getKeys().length==0){ this.setInterestedFrequency(interestedSet); } if(rFrequency.getKeys().length==0){ this.setReducingFrequency(reducingSet); } for(int i = 0;i<reducingSet.size();i++){ if(reducingSet.get(i).equalsIgnoreCase(reducingClass)){ if(interestedSet.get(i).equalsIgnoreCase(interestedClass)){ conditionalData.add(i); } } } int numerator = conditionalData.size(); int denominator = this.rFrequency.getNum(reducingClass); return (double)numerator/denominator;}public double jointEntropy(List<String> set1, List<String> set2){ String[] set1Keys; String[] set2Keys; Double prob1; Double prob2; Double entropy = 0.0; if(this.iFrequency.getKeys().length==0){ this.setInterestedFrequency(set1); } if(this.rFrequency.getKeys().length==0){ this.setReducingFrequency(set2); } set1Keys = this.iFrequency.getKeys(); set2Keys = this.rFrequency.getKeys(); for(int i=0;i<set1Keys.length;i++){ for(int j=0;j<set2Keys.length;j++){ prob1 = iFrequency.getPct(set1Keys[i]); prob2 = rFrequency.getPct(set2Keys[j]); entropy = entropy - (prob1*prob2)*log((prob1*prob2),2); } } return entropy;}public double conditionalEntropy(List<String> interestedSet, List<String> reducingSet){ double jointEntropy = jointEntropy(interestedSet,reducingSet); double reducingEntropyX = entropy(reducingSet); double conEntYgivenX = jointEntropy - reducingEntropyX; return conEntYgivenX;}
过去几天我一直在试图找出为什么我的熵计算几乎总是与我的条件熵计算完全相同。
我使用了以下公式:
H(X) = – Sigma from x=1 to x=n p(x)*log(p(x))
H(XY) = – Sigma from x=1 to x=n,y=1 to y=m (p(x)*p(y)) * log(p(x)*p(y))
H(X|Y) = H(XY) – H(X)
我得到的熵和条件熵的值几乎相同。
使用我用于测试的数据集,我得到了以下值:
@Testpublic void testEntropy(){ FileHelper fileHelper = new FileHelper(); List<String> lines = fileHelper.readFileToMemory(""); Data freshData = fileHelper.parseCSVData(lines); LinkedList<String> headersToChange = new LinkedList<String>(); headersToChange.add("lwt"); Data discreteData = freshData.discretize(freshData.getData(),headersToChange,1,10); Entropy entropy = new Entropy(); Double result = entropy.entropy(discreteData.getData().get("lwt")); assertEquals(2.48,result,.006);}@Testpublic void testConditionalProbability(){ FileHelper fileHelper = new FileHelper(); List<String> lines = fileHelper.readFileToMemory(""); Data freshData = fileHelper.parseCSVData(lines); LinkedList<String> headersToChange = new LinkedList<String>(); headersToChange.add("age"); headersToChange.add("lwt"); Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10); Entropy entropy = new Entropy(); double conditionalProb = entropy.conditionalProbability(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6"); assertEquals(.1,conditionalProb,.005);}@Testpublic void testJointEntropy(){ FileHelper fileHelper = new FileHelper(); List<String> lines = fileHelper.readFileToMemory(""); Data freshData = fileHelper.parseCSVData(lines); LinkedList<String> headersToChange = new LinkedList<String>(); headersToChange.add("age"); headersToChange.add("lwt"); Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10); Entropy entropy = new Entropy(); double jointEntropy = entropy.jointEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age")); assertEquals(5.05,jointEntropy,.006);}@Testpublic void testSpecifiedConditionalEntropy(){ FileHelper fileHelper = new FileHelper(); List<String> lines = fileHelper.readFileToMemory(""); Data freshData = fileHelper.parseCSVData(lines); LinkedList<String> headersToChange = new LinkedList<String>(); headersToChange.add("age"); headersToChange.add("lwt"); Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10); Entropy entropy = new Entropy(); double specCondiEntropy = entropy.specifiedConditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6"); assertEquals(.332,specCondiEntropy,.005);}@Testpublic void testConditionalEntropy(){ FileHelper fileHelper = new FileHelper(); List<String> lines = fileHelper.readFileToMemory(""); Data freshData = fileHelper.parseCSVData(lines); LinkedList<String> headersToChange = new LinkedList<String>(); headersToChange.add("age"); headersToChange.add("lwt"); Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10); Entropy entropy = new Entropy(); Double result = entropy.conditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age")); assertEquals(2.47,result,.006);}
一切都正确编译,但我几乎可以肯定我的条件熵计算是错误的,但我不知道我犯了什么错误。
单元测试中的值是我当前得到的值。它们与上述函数的输出相同。
在某一时刻,我也使用以下内容进行测试:
List<String> survived = Arrays.asList("1","0","1","1","0","1","0","0","0","1","0","1","0","0","1");List<String> sex = Arrays.asList("0","1","0","1","1","0","0","1","1","0","1","0","0","1","1");
其中男性 = 1,生还 = 1。然后我用它来计算
double result = entropy.entropy(survived);assertEquals(.996,result,.006);
以及
double jointEntropy = entropy.jointEntropy(survived,sex);assertEquals(1.99,jointEntropy,.006);
我也通过手动计算来检查我的工作。你可以在这里看到一张图片:链接。由于我的代码给出的值与我手动计算时得到的值相同,并且由于其他函数非常简单,仅使用了熵/联合熵函数,我认为一切都很好。
然而,确实出了问题。下面是我编写的两个额外函数,用于计算信息增益和一组的对称不确定性。
public double informationGain(List<String> interestedSet, List<String> reducingSet){ double entropy = entropy(interestedSet); double conditionalEntropy = conditionalEntropy(interestedSet,reducingSet); double infoGain = entropy - conditionalEntropy; return infoGain;}public double symmetricalUncertainty(List<String> interestedSet, List<String> reducingSet){ double infoGain = informationGain(interestedSet,reducingSet); double intSet = entropy(interestedSet); double redSet = entropy(reducingSet); double symUnc = 2 * ( infoGain/ (intSet+redSet) ); return symUnc;}
我最初使用的生存/性别数据集给了我一个略微负面的答案。但由于它仅负了0.000000000000002,我认为这只是一个舍入误差。当我尝试运行我的程序时,我得到的对称不确定性值没有任何意义。
回答: