我正在按照这个例子创建一个用于scikit-learn中文本数据的多项式朴素贝叶斯分类器。然而,混淆矩阵和分类器的F-1得分输出不正确。我认为错误与我使用的输入数据格式有关。我为每个训练样本准备了一个csv文件。每个csv文件包含一行特征,格式如’blah, blahblah, andsoon’。每个文件被分类为正面或负面。如何正确读取这些文件?
这是我的代码:
import numpyimport csvfrom pandas import DataFramefrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.pipeline import Pipelinefrom sklearn.cross_validation import KFoldfrom sklearn.metrics import confusion_matrix, f1_scoreNEWLINE = '\n'NEGATIVE = 'negative'POSITIVE = 'positive'SOURCES = [ ('negative\\', NEGATIVE), ('positive\\', POSITIVE)]SKIP_FILES = {'cmds'}def build_data_frame(policies, path, classification): rows = [] index = [] for policy in policies: current_csv = path + policy + '.csv' # 检查文件是否存在 if (os.path.isfile(current_csv)): with open(current_csv, 'r') as csvfile: reader = csv.reader(csvfile, delimiter=',', quotechar='"') # 获取策略中的每一行 for row in reader: # 从文本列表中移除所有逗号 clean_row = ' '.join(row) rows.append({'text': clean_row, 'class': classification}) index.append(current_csv) data_frame = DataFrame(rows, index=index) return data_framedef policy_analyzer_main(policies, write_pol_path): data = DataFrame({'text': [], 'class': []}) for path, classification in SOURCES: data = data.append(build_data_frame(policies, write_pol_path + path, classification)) classify(data)pipeline = Pipeline([ ('count_vectorizer', CountVectorizer()), ('classifier', MultinomialNB())])def classify(data): k_fold = KFold(n=len(data), n_folds=10) scores = [] confusion = numpy.array([[0, 0], [0, 0]]) for train_indices, test_indices in k_fold: train_text = data.iloc[train_indices]['text'].values train_y = data.iloc[train_indices]['class'].values.astype(str) test_text = data.iloc[test_indices]['text'].values test_y = data.iloc[test_indices]['class'].values.astype(str) pipeline.fit(train_text, train_y) predictions = pipeline.predict(test_text) confusion += confusion_matrix(test_y, predictions) score = f1_score(test_y, predictions, pos_label=POSITIVE) scores.append(score) print('Total emails classified:', len(data)) print('Score:', sum(scores)/len(scores)) print('Confusion matrix:') print(confusion)
这是我收到的一个警告消息示例:
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples. 'precision', 'predicted', average, warn_for) ('Total emails classified:', 75) ('Score:', 0.025000000000000001)Confusion matrix:[[39 35] [46 24]]
回答:
查看你在每次训练-测试分割时的predictions
。因为那个警告意味着你的算法将所有测试样本标记为负面,而测试集中有些样本是正面的(可能只有一个是正面的,但无论如何都会引发这个警告)。
还要查看你对数据集的分割,因为有可能某些测试分割中只包含一个正面样本,但你的分类器错误地分类了它。
例如,在这种情况下会引发该警告(为了清楚地说明你的代码中发生了什么):
from sklearn.metrics import f1_score# 这里我们有4个样本的4个标签f1_score([0,0,1,0],[0,0,0,0])/usr/local/lib/python3.4/dist-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.'precision', 'predicted', average, warn_for)