我的代码是否正确计算了数据集的熵/条件熵?

我正在编写一个Java程序,希望能够在给定数据集时计算熵、联合熵、条件熵等。相关类如下所示:

public class Entropy {private Frequency<String> iFrequency = new Frequency<String>();private Frequency<String> rFrequency = new Frequency<String>();Entropy(){    super();}public void setInterestedFrequency(List<String> interestedFrequency){    for(String s: interestedFrequency){        this.iFrequency.addValue(s);    }}public void setReducingFrequency(List<String> reducingFrequency){    for(String s:reducingFrequency){        this.rFrequency.addValue(s);    }}private double log(double num, int base){   return Math.log(num)/Math.log(base);}public double entropy(List<String> data){    double entropy = 0.0;    double prob = 0.0;    Frequency<String> frequency = new Frequency<String>();    for(String s:data){        frequency.addValue(s);    }    String[] keys = frequency.getKeys();    for(int i=0;i<keys.length;i++){        prob = frequency.getPct(keys[i]);        entropy = entropy - prob * log(prob,2);    }    return entropy;}/** return conditional probability of P(interestedClass|reducingClass)* */public double conditionalProbability(List<String> interestedSet,                                     List<String> reducingSet,                                     String interestedClass,                                     String reducingClass){    List<Integer> conditionalData = new LinkedList<Integer>();    if(iFrequency.getKeys().length==0){        this.setInterestedFrequency(interestedSet);    }    if(rFrequency.getKeys().length==0){        this.setReducingFrequency(reducingSet);    }    for(int i = 0;i<reducingSet.size();i++){        if(reducingSet.get(i).equalsIgnoreCase(reducingClass)){            if(interestedSet.get(i).equalsIgnoreCase(interestedClass)){                conditionalData.add(i);            }        }    }    int numerator = conditionalData.size();    int denominator = this.rFrequency.getNum(reducingClass);    return (double)numerator/denominator;}public double jointEntropy(List<String> set1, List<String> set2){    String[] set1Keys;    String[] set2Keys;    Double prob1;    Double prob2;    Double entropy = 0.0;    if(this.iFrequency.getKeys().length==0){        this.setInterestedFrequency(set1);    }    if(this.rFrequency.getKeys().length==0){        this.setReducingFrequency(set2);    }    set1Keys = this.iFrequency.getKeys();    set2Keys = this.rFrequency.getKeys();    for(int i=0;i<set1Keys.length;i++){        for(int j=0;j<set2Keys.length;j++){            prob1 = iFrequency.getPct(set1Keys[i]);            prob2 = rFrequency.getPct(set2Keys[j]);            entropy = entropy - (prob1*prob2)*log((prob1*prob2),2);        }    }    return entropy;}public double conditionalEntropy(List<String> interestedSet, List<String> reducingSet){    double jointEntropy = jointEntropy(interestedSet,reducingSet);    double reducingEntropyX = entropy(reducingSet);    double conEntYgivenX = jointEntropy - reducingEntropyX;    return conEntYgivenX;}

过去几天我一直在试图找出为什么我的熵计算几乎总是与我的条件熵计算完全相同。

我使用了以下公式:

H(X) = – Sigma from x=1 to x=n p(x)*log(p(x))

H(XY) = – Sigma from x=1 to x=n,y=1 to y=m (p(x)*p(y)) * log(p(x)*p(y))

H(X|Y) = H(XY) – H(X)

我得到的熵和条件熵的值几乎相同。

使用我用于测试的数据集,我得到了以下值:

@Testpublic void testEntropy(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(),headersToChange,1,10);    Entropy entropy = new Entropy();    Double result = entropy.entropy(discreteData.getData().get("lwt"));    assertEquals(2.48,result,.006);}@Testpublic void testConditionalProbability(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("age");    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);    Entropy entropy = new Entropy();    double conditionalProb = entropy.conditionalProbability(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6");    assertEquals(.1,conditionalProb,.005);}@Testpublic void testJointEntropy(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("age");    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);    Entropy entropy = new Entropy();    double jointEntropy = entropy.jointEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"));    assertEquals(5.05,jointEntropy,.006);}@Testpublic void testSpecifiedConditionalEntropy(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("age");    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);    Entropy entropy = new Entropy();    double specCondiEntropy = entropy.specifiedConditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6");    assertEquals(.332,specCondiEntropy,.005);}@Testpublic void testConditionalEntropy(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("age");    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);    Entropy entropy = new Entropy();    Double result = entropy.conditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"));    assertEquals(2.47,result,.006);}

一切都正确编译,但我几乎可以肯定我的条件熵计算是错误的,但我不知道我犯了什么错误。

单元测试中的值是我当前得到的值。它们与上述函数的输出相同。

在某一时刻,我也使用以下内容进行测试:

List<String> survived = Arrays.asList("1","0","1","1","0","1","0","0","0","1","0","1","0","0","1");List<String> sex = Arrays.asList("0","1","0","1","1","0","0","1","1","0","1","0","0","1","1");

其中男性 = 1,生还 = 1。然后我用它来计算

double result = entropy.entropy(survived);assertEquals(.996,result,.006);

以及

double jointEntropy = entropy.jointEntropy(survived,sex);assertEquals(1.99,jointEntropy,.006);

我也通过手动计算来检查我的工作。你可以在这里看到一张图片:链接。由于我的代码给出的值与我手动计算时得到的值相同,并且由于其他函数非常简单,仅使用了熵/联合熵函数,我认为一切都很好。

然而,确实出了问题。下面是我编写的两个额外函数,用于计算信息增益和一组的对称不确定性。

public double informationGain(List<String> interestedSet, List<String> reducingSet){    double entropy = entropy(interestedSet);    double conditionalEntropy = conditionalEntropy(interestedSet,reducingSet);    double infoGain = entropy - conditionalEntropy;    return infoGain;}public double symmetricalUncertainty(List<String> interestedSet, List<String> reducingSet){    double infoGain = informationGain(interestedSet,reducingSet);    double intSet = entropy(interestedSet);    double redSet = entropy(reducingSet);    double symUnc = 2 * ( infoGain/ (intSet+redSet) );    return symUnc;}

我最初使用的生存/性别数据集给了我一个略微负面的答案。但由于它仅负了0.000000000000002,我认为这只是一个舍入误差。当我尝试运行我的程序时,我得到的对称不确定性值没有任何意义。


回答:

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注