我想根据内容将某些数据分类到不同的类别中。我使用了朴素贝叶斯分类器进行分类,并得到了最佳类别的输出。但现在我想将训练集之外的新闻归类到“其他”类中。由于其他类别的数量庞大,我无法手动将训练数据之外的每条数据添加到某个类别中。那么,有没有办法对其他数据进行分类呢?
private static File TRAINING_DIR = new File("4news-train");private static File TESTING_DIR = new File("4news-test");private static String[] CATEGORIES = { "c1", "c2", "c3", "others" };private static int NGRAM_SIZE = 6;public static void main(String[] args) throws ClassNotFoundException, IOException { DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(CATEGORIES, NGRAM_SIZE); for (int i = 0; i < CATEGORIES.length; ++i) { File classDir = new File(TRAINING_DIR, CATEGORIES[i]); if (!classDir.isDirectory()) { String msg = "Could not find training directory=" + classDir + "\nTraining directory not found"; System.out.println(msg); // in case exception gets lost in shell throw new IllegalArgumentException(msg); } String[] trainingFiles = classDir.list(); for (int j = 0; j < trainingFiles.length; ++j) { File file = new File(classDir, trainingFiles[j]); String text = Files.readFromFile(file, "ISO-8859-1"); System.out.println("Training on " + CATEGORIES[i] + "/" + trainingFiles[j]); Classification classification = new Classification(CATEGORIES[i]); Classified<CharSequence> classified = new Classified<CharSequence>(text, classification); classifier.handle(classified); } }}
回答:
只需序列化对象…这意味着将中间对象写入文件,这就是你的模型…
然后在测试时,只需将数据传入模型中,不需要每次都进行训练…这对你来说会相当容易