我的代码是否正确计算了数据集的熵/条件熵?

我正在编写一个Java程序,希望能够在给定数据集时计算熵、联合熵、条件熵等。相关类如下所示:

public class Entropy {private Frequency<String> iFrequency = new Frequency<String>();private Frequency<String> rFrequency = new Frequency<String>();Entropy(){    super();}public void setInterestedFrequency(List<String> interestedFrequency){    for(String s: interestedFrequency){        this.iFrequency.addValue(s);    }}public void setReducingFrequency(List<String> reducingFrequency){    for(String s:reducingFrequency){        this.rFrequency.addValue(s);    }}private double log(double num, int base){   return Math.log(num)/Math.log(base);}public double entropy(List<String> data){    double entropy = 0.0;    double prob = 0.0;    Frequency<String> frequency = new Frequency<String>();    for(String s:data){        frequency.addValue(s);    }    String[] keys = frequency.getKeys();    for(int i=0;i<keys.length;i++){        prob = frequency.getPct(keys[i]);        entropy = entropy - prob * log(prob,2);    }    return entropy;}/** return conditional probability of P(interestedClass|reducingClass)* */public double conditionalProbability(List<String> interestedSet,                                     List<String> reducingSet,                                     String interestedClass,                                     String reducingClass){    List<Integer> conditionalData = new LinkedList<Integer>();    if(iFrequency.getKeys().length==0){        this.setInterestedFrequency(interestedSet);    }    if(rFrequency.getKeys().length==0){        this.setReducingFrequency(reducingSet);    }    for(int i = 0;i<reducingSet.size();i++){        if(reducingSet.get(i).equalsIgnoreCase(reducingClass)){            if(interestedSet.get(i).equalsIgnoreCase(interestedClass)){                conditionalData.add(i);            }        }    }    int numerator = conditionalData.size();    int denominator = this.rFrequency.getNum(reducingClass);    return (double)numerator/denominator;}public double jointEntropy(List<String> set1, List<String> set2){    String[] set1Keys;    String[] set2Keys;    Double prob1;    Double prob2;    Double entropy = 0.0;    if(this.iFrequency.getKeys().length==0){        this.setInterestedFrequency(set1);    }    if(this.rFrequency.getKeys().length==0){        this.setReducingFrequency(set2);    }    set1Keys = this.iFrequency.getKeys();    set2Keys = this.rFrequency.getKeys();    for(int i=0;i<set1Keys.length;i++){        for(int j=0;j<set2Keys.length;j++){            prob1 = iFrequency.getPct(set1Keys[i]);            prob2 = rFrequency.getPct(set2Keys[j]);            entropy = entropy - (prob1*prob2)*log((prob1*prob2),2);        }    }    return entropy;}public double conditionalEntropy(List<String> interestedSet, List<String> reducingSet){    double jointEntropy = jointEntropy(interestedSet,reducingSet);    double reducingEntropyX = entropy(reducingSet);    double conEntYgivenX = jointEntropy - reducingEntropyX;    return conEntYgivenX;}

过去几天我一直在试图找出为什么我的熵计算几乎总是与我的条件熵计算完全相同。

我使用了以下公式:

H(X) = – Sigma from x=1 to x=n p(x)*log(p(x))

H(XY) = – Sigma from x=1 to x=n,y=1 to y=m (p(x)*p(y)) * log(p(x)*p(y))

H(X|Y) = H(XY) – H(X)

我得到的熵和条件熵的值几乎相同。

使用我用于测试的数据集,我得到了以下值:

@Testpublic void testEntropy(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(),headersToChange,1,10);    Entropy entropy = new Entropy();    Double result = entropy.entropy(discreteData.getData().get("lwt"));    assertEquals(2.48,result,.006);}@Testpublic void testConditionalProbability(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("age");    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);    Entropy entropy = new Entropy();    double conditionalProb = entropy.conditionalProbability(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6");    assertEquals(.1,conditionalProb,.005);}@Testpublic void testJointEntropy(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("age");    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);    Entropy entropy = new Entropy();    double jointEntropy = entropy.jointEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"));    assertEquals(5.05,jointEntropy,.006);}@Testpublic void testSpecifiedConditionalEntropy(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("age");    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);    Entropy entropy = new Entropy();    double specCondiEntropy = entropy.specifiedConditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6");    assertEquals(.332,specCondiEntropy,.005);}@Testpublic void testConditionalEntropy(){    FileHelper fileHelper = new FileHelper();    List<String> lines = fileHelper.readFileToMemory("");    Data freshData = fileHelper.parseCSVData(lines);    LinkedList<String> headersToChange = new LinkedList<String>();    headersToChange.add("age");    headersToChange.add("lwt");    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);    Entropy entropy = new Entropy();    Double result = entropy.conditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"));    assertEquals(2.47,result,.006);}

一切都正确编译,但我几乎可以肯定我的条件熵计算是错误的,但我不知道我犯了什么错误。

单元测试中的值是我当前得到的值。它们与上述函数的输出相同。

在某一时刻,我也使用以下内容进行测试:

List<String> survived = Arrays.asList("1","0","1","1","0","1","0","0","0","1","0","1","0","0","1");List<String> sex = Arrays.asList("0","1","0","1","1","0","0","1","1","0","1","0","0","1","1");

其中男性 = 1,生还 = 1。然后我用它来计算

double result = entropy.entropy(survived);assertEquals(.996,result,.006);

以及

double jointEntropy = entropy.jointEntropy(survived,sex);assertEquals(1.99,jointEntropy,.006);

我也通过手动计算来检查我的工作。你可以在这里看到一张图片:链接。由于我的代码给出的值与我手动计算时得到的值相同,并且由于其他函数非常简单,仅使用了熵/联合熵函数,我认为一切都很好。

然而,确实出了问题。下面是我编写的两个额外函数,用于计算信息增益和一组的对称不确定性。

public double informationGain(List<String> interestedSet, List<String> reducingSet){    double entropy = entropy(interestedSet);    double conditionalEntropy = conditionalEntropy(interestedSet,reducingSet);    double infoGain = entropy - conditionalEntropy;    return infoGain;}public double symmetricalUncertainty(List<String> interestedSet, List<String> reducingSet){    double infoGain = informationGain(interestedSet,reducingSet);    double intSet = entropy(interestedSet);    double redSet = entropy(reducingSet);    double symUnc = 2 * ( infoGain/ (intSet+redSet) );    return symUnc;}

我最初使用的生存/性别数据集给了我一个略微负面的答案。但由于它仅负了0.000000000000002,我认为这只是一个舍入误差。当我尝试运行我的程序时,我得到的对称不确定性值没有任何意义。


回答:

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注