### Java中感知器实现的数据结构困惑

我正在尝试用Java实现感知器算法，只是一层的那种，不是完整的神经网络类型。我要解决的是一个分类问题。

我需要为四个类别中的每一个文档创建一个词袋特征向量，这些类别分别是政治、科学、体育和无神论。这是数据。

我试图实现这个目标（这是这个问题第一个回答的直接引用）：

示例：

Document 1 = ["I", "am", "awesome"]Document 2 = ["I", "am", "great", "great"]

字典是：

["I", "am", "awesome", "great"]

因此，文档作为向量看起来像这样：

Document 1 = [1, 1, 1, 0]Document 2 = [1, 1, 0, 2]

有了这个，你就可以进行各种复杂的数学运算，并将其输入到你的感知器中。

我已经能够生成全局字典，现在我需要为每个文档创建一个，但如何才能把它们都整理好呢？文件夹结构非常简单，例如，`/politics/`内部有很多文章，我需要为每个文章根据全局字典创建一个特征向量。我认为我使用的迭代器让我感到困惑。

这是主类：

public class BagOfWords {    static Set<String> global_dict = new HashSet<String>();    static boolean global_dict_complete = false;     static String path = "/home/Workbench/SUTD/ISTD_50.570/assignments/data/train";    public static void main(String[] args) throws IOException     {        //每个不同的类别        String[] categories = { "/atheism", "/politics", "/science", "/sports"};        //循环遍历所有类别一次以填充全局字典        for(int cycle = 0; cycle <= 3; cycle++)        {            String general_data_partition = path + categories[cycle];             File file = new File( general_data_partition );            Iterateur.iterateDirectory(file, global_dict, global_dict_complete);        }           //在全局字典填充完成后        //再次循环以填充        //每个文档的一组单词，并与        //全局字典进行比较。         for(int cycle = 0; cycle <= 3; cycle++)        {            if(cycle == 3)                global_dict_complete = true;            String general_data_partition = path + categories[cycle];             File file = new File( general_data_partition );            Iterateur.iterateDirectory(file, global_dict, global_dict_complete);        }        //打印数据结构                      //for (String s : global_dict)            //System.out.println( s );    }}

这遍历数据结构：

public class Iterateur {    static void iterateDirectory(File file,                              Set<String> global_dict,                              boolean global_dict_complete) throws IOException     {        for (File f : file.listFiles())         {            if (f.isDirectory())             {                iterateDirectory(file, global_dict, global_dict_complete);            }             else             {                String line;                 BufferedReader br = new BufferedReader(new FileReader( f ));                while ((line = br.readLine()) != null)                 {                    if (global_dict_complete == false)                    {                        Dictionary.populate_dict(file, f, line, br, global_dict);                    }                    else                    {                        FeatureVecteur.generateFeatureVecteur(file, f, line, br, global_dict);                    }                }            }        }    }}

这填充全局字典：

public class Dictionary {    public static void populate_dict(File file,                                  File f,                                  String line,                                  BufferedReader br,                                  Set<String> global_dict) throws IOException    {        while ((line = br.readLine()) != null)         {            String[] words = line.split(" ");//这些是你的单词            String word;            for (int i = 0; i < words.length; i++)             {                word = words[i];                if (!global_dict.contains(word))                {                    global_dict.add(word);                }            }           }    }}

这是尝试填充文档特定字典的初步尝试：

public class FeatureVecteur {    public static void generateFeatureVecteur(File file,                                           File f,                                           String line,                                           BufferedReader br,                                           Set<String> global_dict) throws IOException    {        Set<String> file_dict = new HashSet<String>();        while ((line = br.readLine()) != null)         {            String[] words = line.split(" ");//这些是你的单词            String word;            for (int i = 0; i < words.length; i++)             {                word = words[i];                if (!file_dict.contains(word))                {                    file_dict.add(word);                }            }           }    }}

回答：

如果我理解你的问题，你是想统计全局字典中每个单词在给定文件中出现的次数。我建议创建一个整数数组，其中索引代表全局字典的索引，值代表该单词在文件中出现的次数。

然后，对于全局字典中的每个单词，统计该单词在文件中出现的次数。然而，你需要小心 – 特征向量需要元素的一致排序，而HashSet并不能保证这一点。例如，在你的例子中，“I”总是需要是第一个元素。为了解决这个问题，你可能需要在全局字典完全完成后将其转换为ArrayList或其他顺序列表。

ArrayList<String> global_dict_list = ArrayList<String>( global_dict );

统计可能看起来像这样

int[] wordFrequency = new int[global_dict_list.size()];for ( String globalWord : global_dict_list ){    for ( int i = 0; i < words.length; i++ )     {         if ( words[i].equals(globalWord) )          {             wordFrequency[i]++;         }    }}

将这段代码嵌套在特征向量代码中逐行读取的while循环中。希望这对你有帮助！

学技术

### Java中感知器实现的数据结构困惑

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复