目前我使用以下代码来训练分类器模型:
final String iterations = "1000"; final String cutoff = "0"; InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("src/main/resources/trainingSets/classifierA.txt")); ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream); TrainingParameters params = new TrainingParameters(); params.put(TrainingParameters.ITERATIONS_PARAM, iterations); params.put(TrainingParameters.CUTOFF_PARAM, cutoff); params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE); DoccatModel model = DocumentCategorizerME.train("NL", sampleStream, params, new DoccatFactory()); OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("src/main/resources/models/model.bin")); model.serialize(modelOut); return model;
运行顺利,每次运行后我得到以下输出:
Indexing events with TwoPass using cutoff of 0 Computing event counts... done. 1474 events Indexing... done.Collecting events... Done indexing in 0,03 s.Incorporating indexed data for training... done. Number of Event Tokens: 1474 Number of Outcomes: 2 Number of Predicates: 4149Computing model parameters...Stats: (998/1474) 0.6770691994572592...done.
能否有人解释一下这个输出的含义?它是否说明了准确率?
回答:
查看源码,我们可以看出这个输出是由NaiveBayesTrainer::trainModel方法生成的:
public AbstractModel trainModel(DataIndexer di) { // ... display("done.\n"); display("\tNumber of Event Tokens: " + numUniqueEvents + "\n"); display("\t Number of Outcomes: " + numOutcomes + "\n"); display("\t Number of Predicates: " + numPreds + "\n"); display("Computing model parameters...\n"); MutableContext[] finalParameters = findParameters(); display("...done.\n"); // ...}
如果你查看findParameters()
的代码,会注意到它调用了trainingStats()
方法,其中包含计算准确率的代码片段:
private double trainingStats(EvalParameters evalParams) { // ... double trainingAccuracy = (double) numCorrect / numEvents; display("Stats: (" + numCorrect + "/" + numEvents + ") " + trainingAccuracy + "\n"); return trainingAccuracy;}
TL;DR 输出中的Stats: (998/1474) 0.6770691994572592
部分就是你要找的准确率。