如何使用训练好的sklearn高斯朴素贝叶斯分类器预测电子邮件的标签?

我已经在电子邮件(垃圾邮件/非垃圾邮件)数据集上创建了一个高斯朴素贝叶斯分类器,并且成功运行了它。我对数据进行了向量化,分成了训练集和测试集,然后计算了准确率,涵盖了sklearn高斯朴素贝叶斯分类器的所有功能。

现在我想使用这个分类器来预测新电子邮件的“标签”——它们是否是垃圾邮件。例如,假设我有一封电子邮件,我想将其输入到我的分类器中,并获得它是否为垃圾邮件的预测结果。我该如何实现这一点?请帮助我。

分类器文件的代码。

#!/usr/bin/pythonimport sysfrom time import timeimport logging# Display progress logs on stdoutlogging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')sys.path.append("../DatasetProcessing/")from vectorize_split_dataset import preprocess### features_train and features_test are the featuresfor the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labelsfeatures_train, features_test, labels_train, labels_test = preprocess()#########################################################from sklearn.naive_bayes import GaussianNBclf = GaussianNB()t0 = time()clf.fit(features_train, labels_train)pred = clf.predict(features_test)print("training time:", round(time() - t0, 3), "s")print(clf.score(features_test, labels_test))## Printing Metricsfor Training and Testingprint("No. of Testing Features:" + str(len(features_test)))print("No. of Testing Features Label:" + str(len(labels_test)))print("No. of Training Features:" + str(len(features_train)))print("No. of Training Features Label:" + str(len(labels_train)))print("No. of Predicted Features:" + str(len(pred)))## Calculating Classifier Performancefrom sklearn.metrics import classification_reporty_true = labels_testy_pred = predlabels = ['0', '1']target_names = ['class 0', 'class 1']print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))# How to predict label of a new textnew_text = "You won a lottery at UK lottery commission. Reply to claim it"

向量化代码

#!/usr/bin/pythonimport osimport pickleimport numpynumpy.random.seed(42)path = os.path.dirname(os.path.abspath(__file__))### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehandfeature_data_file = path + "./createdDataset/dataSet.pkl"label_data_file = path + "./createdDataset/dataLabel.pkl"feature_data = pickle.load(open(feature_data_file, "rb"))label_data = pickle.load(open(label_data_file, "rb"))### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representationsfor compatibility with### classifier functions in versions 0.15.2 and earlierfrom sklearn import cross_validationfeatures_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')features_train = vectorizer.fit_transform(features_train)features_test = vectorizer.transform(features_test)#.toarray()## feature selection to reduce dimensionalityfrom sklearn.feature_selection import SelectPercentile, f_classifselector = SelectPercentile(f_classif, percentile = 5)selector.fit(features_train, labels_train)features_train_transformed_reduced = selector.transform(features_train).toarray()features_test_transformed_reduced = selector.transform(features_test).toarray()features_train = features_train_transformed_reducedfeatures_test = features_test_transformed_reduceddef preprocess():  return features_train, features_test, labels_train, labels_test

数据集生成代码

#!/usr/bin/pythonimport osimport pickleimport reimport sys# sys.path.append("../tools/")"""    Starter code to process the texts of accuate and inaccurate category to extract    the features and get the documents ready for classification.    The list of all the texts from accurate category are in the accurate_files list    likewise for texts of inaccurate category are in (inaccurate_files)    The data is stored in lists and packed away in pickle files at the end."""accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")label_data = []feature_data = []### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quickertemp_counter = 0for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:  for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full datasettemp_counter = 1if temp_counter < 200:  path = os.path.join('..', path[: -1])print(path)text = open(path, "r")line = text.readline()while line: ###use afunction parseOutText to extract the text from the opened text# stem_text = parseOutText(text)stem_text = text.readline().strip()print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_datafeature_data.append(stem_text)### append a 0 to label_dataif text is from Sara, and 1if text is from Chrisif (name == "accurate"):  label_data.append("0")elif(name == "inaccurate"):  label_data.append("1")line = text.readline()text.close()print("texts processed")accurate_files.close()inaccurate_files.close()pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))

另外,我想知道是否可以增量训练分类器,也就是说,是否可以用新的数据重新训练已创建的模型,以便随着时间的推移不断改进模型?

如果有人能帮我解决这个问题,我将非常感激。我现在真的卡在这里了。


回答:

您已经在使用您的模型来预测测试集中的电子邮件标签。这就是pred = clf.predict(features_test)所做的。如果您想查看这些标签,可以使用print pred

但也许您想知道如何预测未来发现的不在当前测试集中的电子邮件的标签?如果是这样,您可以将您的新电子邮件视为一个新的测试集。和您之前的测试集一样,您需要对数据进行几个关键的处理步骤:

1)首先,您需要为您的新电子邮件数据生成特征。特征生成步骤在您上面的代码中没有包含,但需要进行。

2)您使用的是Tfidf向量化器,它根据术语频率和逆文档频率将文档集合转换为Tfidf特征矩阵。您需要将您的新电子邮件测试特征数据通过您在训练数据上拟合的向量化器处理。

3)然后,您的新电子邮件测试特征数据需要使用您在训练数据上拟合的相同selector进行降维处理。

4)最后,对您的新测试数据运行预测。如果您想查看新的标签,可以使用print pred

关于您最后一个问题,关于迭代重新训练您的模型,是的,您绝对可以这样做。这只是选择一个频率,制作一个脚本来扩展您的数据集以包含新的数据,然后从预处理、Tfidf向量化、降维、拟合到预测重新运行所有步骤的问题。

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注