如何使用训练好的sklearn高斯朴素贝叶斯分类器预测电子邮件的标签?

我已经在电子邮件(垃圾邮件/非垃圾邮件)数据集上创建了一个高斯朴素贝叶斯分类器,并且成功运行了它。我对数据进行了向量化,分成了训练集和测试集,然后计算了准确率,涵盖了sklearn高斯朴素贝叶斯分类器的所有功能。

现在我想使用这个分类器来预测新电子邮件的“标签”——它们是否是垃圾邮件。例如,假设我有一封电子邮件,我想将其输入到我的分类器中,并获得它是否为垃圾邮件的预测结果。我该如何实现这一点?请帮助我。

分类器文件的代码。

#!/usr/bin/pythonimport sysfrom time import timeimport logging# Display progress logs on stdoutlogging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')sys.path.append("../DatasetProcessing/")from vectorize_split_dataset import preprocess### features_train and features_test are the featuresfor the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labelsfeatures_train, features_test, labels_train, labels_test = preprocess()#########################################################from sklearn.naive_bayes import GaussianNBclf = GaussianNB()t0 = time()clf.fit(features_train, labels_train)pred = clf.predict(features_test)print("training time:", round(time() - t0, 3), "s")print(clf.score(features_test, labels_test))## Printing Metricsfor Training and Testingprint("No. of Testing Features:" + str(len(features_test)))print("No. of Testing Features Label:" + str(len(labels_test)))print("No. of Training Features:" + str(len(features_train)))print("No. of Training Features Label:" + str(len(labels_train)))print("No. of Predicted Features:" + str(len(pred)))## Calculating Classifier Performancefrom sklearn.metrics import classification_reporty_true = labels_testy_pred = predlabels = ['0', '1']target_names = ['class 0', 'class 1']print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))# How to predict label of a new textnew_text = "You won a lottery at UK lottery commission. Reply to claim it"

向量化代码

#!/usr/bin/pythonimport osimport pickleimport numpynumpy.random.seed(42)path = os.path.dirname(os.path.abspath(__file__))### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehandfeature_data_file = path + "./createdDataset/dataSet.pkl"label_data_file = path + "./createdDataset/dataLabel.pkl"feature_data = pickle.load(open(feature_data_file, "rb"))label_data = pickle.load(open(label_data_file, "rb"))### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representationsfor compatibility with### classifier functions in versions 0.15.2 and earlierfrom sklearn import cross_validationfeatures_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')features_train = vectorizer.fit_transform(features_train)features_test = vectorizer.transform(features_test)#.toarray()## feature selection to reduce dimensionalityfrom sklearn.feature_selection import SelectPercentile, f_classifselector = SelectPercentile(f_classif, percentile = 5)selector.fit(features_train, labels_train)features_train_transformed_reduced = selector.transform(features_train).toarray()features_test_transformed_reduced = selector.transform(features_test).toarray()features_train = features_train_transformed_reducedfeatures_test = features_test_transformed_reduceddef preprocess():  return features_train, features_test, labels_train, labels_test

数据集生成代码

#!/usr/bin/pythonimport osimport pickleimport reimport sys# sys.path.append("../tools/")"""    Starter code to process the texts of accuate and inaccurate category to extract    the features and get the documents ready for classification.    The list of all the texts from accurate category are in the accurate_files list    likewise for texts of inaccurate category are in (inaccurate_files)    The data is stored in lists and packed away in pickle files at the end."""accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")label_data = []feature_data = []### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quickertemp_counter = 0for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:  for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full datasettemp_counter = 1if temp_counter < 200:  path = os.path.join('..', path[: -1])print(path)text = open(path, "r")line = text.readline()while line: ###use afunction parseOutText to extract the text from the opened text# stem_text = parseOutText(text)stem_text = text.readline().strip()print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_datafeature_data.append(stem_text)### append a 0 to label_dataif text is from Sara, and 1if text is from Chrisif (name == "accurate"):  label_data.append("0")elif(name == "inaccurate"):  label_data.append("1")line = text.readline()text.close()print("texts processed")accurate_files.close()inaccurate_files.close()pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))

另外,我想知道是否可以增量训练分类器,也就是说,是否可以用新的数据重新训练已创建的模型,以便随着时间的推移不断改进模型?

如果有人能帮我解决这个问题,我将非常感激。我现在真的卡在这里了。


回答:

您已经在使用您的模型来预测测试集中的电子邮件标签。这就是pred = clf.predict(features_test)所做的。如果您想查看这些标签,可以使用print pred

但也许您想知道如何预测未来发现的不在当前测试集中的电子邮件的标签?如果是这样,您可以将您的新电子邮件视为一个新的测试集。和您之前的测试集一样,您需要对数据进行几个关键的处理步骤:

1)首先,您需要为您的新电子邮件数据生成特征。特征生成步骤在您上面的代码中没有包含,但需要进行。

2)您使用的是Tfidf向量化器,它根据术语频率和逆文档频率将文档集合转换为Tfidf特征矩阵。您需要将您的新电子邮件测试特征数据通过您在训练数据上拟合的向量化器处理。

3)然后,您的新电子邮件测试特征数据需要使用您在训练数据上拟合的相同selector进行降维处理。

4)最后,对您的新测试数据运行预测。如果您想查看新的标签,可以使用print pred

关于您最后一个问题,关于迭代重新训练您的模型,是的,您绝对可以这样做。这只是选择一个频率,制作一个脚本来扩展您的数据集以包含新的数据,然后从预处理、Tfidf向量化、降维、拟合到预测重新运行所有步骤的问题。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注