我尝试计算我的标注器的准确率。但是,尽管我使用相同的训练和开发数据,每次运行程序时总是得到不同的准确率。这背后的原因是什么?提前谢谢。
with open('train.txt') as f:training_sentences = list(splitter(f))with open('develop.txt') as f:test_sentences = list(splitter(f))...SOME FEATURES AS A LIST OF DICTS...def transform_to_dataset(training_sentences): X, y = [], [] for tagged in training_sentences: for index in range(len(tagged)): X.append(features(untag(tagged), index)) y.append(tagged[index][1]) return X, yX, y = transform_to_dataset(training_sentences)clf = Pipeline([ ('vectorizer', DictVectorizer(sparse=False)), ('classifier', DecisionTreeClassifier(criterion='entropy'))])clf.fit(X, y) X_test, y_test = transform_to_dataset(test_sentences)print "Accuracy:", clf.score(X_test, y_test)
回答:
sklearn
的 DecisionTreeClassifier
使用随机数生成器来决定其分裂。如果你想保证每次运行的结果相同,可以设置分类器的random_state
参数。