Python Bayes心脏预测,结果不准确

我正在尝试使用朴素贝叶斯创建一个心脏病预测程序。当我完成分类器后,交叉验证显示平均准确率为80%。然而,当我尝试对给定样本进行预测时,预测结果完全错误!数据集是来自UCI存储库的心脏病数据集,包含303个样本。有两个类别:0表示健康,1表示患病。当我尝试对数据集中的一个样本进行预测时,除了极少数样本外,它都无法预测出其真实值。以下是代码:

import pandas as pdimport numpy as npfrom sklearn.naive_bayes import GaussianNBfrom sklearn.model_selection import cross_val_score, train_test_splitfrom sklearn.preprocessing import Imputer, StandardScalerclass Predict:    def Read_Clean(self,dataset):        header_row = ['Age', 'Gender', 'Chest_Pain', 'Resting_Blood_Pressure', 'Serum_Cholestrol',                      'Fasting_Blood_Sugar', 'Resting_ECG', 'Max_Heart_Rate',                      'Exercise_Induced_Angina', 'OldPeak',                      'Slope', 'CA', 'Thal', 'Num']        df = pd.read_csv(dataset, names=header_row)        df = df.replace('[?]', np.nan, regex=True)        df = pd.DataFrame(Imputer(missing_values='NaN', strategy='mean', axis=0)                          .fit_transform(df), columns=header_row)        df = df.astype(float)        return df    def Train_Test_Split_data(self,dataset):        Y = dataset['Num'].apply(lambda x: 1 if x > 0 else 0)        X = dataset.drop('Num', axis=1)        validation_size = 0.20        seed = 42        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)        return X_train, X_test, Y_train, Y_test    def Scaler(self, X_train, X_test):        scaler = StandardScaler()        X_train = scaler.fit_transform(X_train)        X_test = scaler.transform(X_test)        return X_train, X_test    def Cross_Validate(self, clf, X_train, Y_train, cv=5):        scores = cross_val_score(clf, X_train, Y_train, cv=cv, scoring='f1')        score = scores.mean()        print("CV scores mean: %.4f " % (score))        return score, scores    def Fit_Score(self, clf, X_train, Y_train, X_test, Y_test, label='x'):        clf.fit(X_train, Y_train)        fit_score = clf.score(X_train, Y_train)        pred_score = clf.score(X_test, Y_test)        print("%s: fit score %.5f, predict score %.5f" % (label, fit_score, pred_score))        return pred_score    def ReturnPredictionValue(self, clf, sample):        y = clf.predict([sample])        return y[0]    def PredictionMain(self, sample, dataset_path='dataset/processed.cleveland.data'):        data = self.Read_Clean(dataset_path)        X_train, X_test, Y_train, Y_test = self.Train_Test_Split_data(data)        X_train, X_test = self.Scaler(X_train, X_test)        self.NB = GaussianNB()        self.Fit_Score(self.NB, X_train, Y_train, X_test, Y_test, label='NB')        self.Cross_Validate(self.NB, X_train, Y_train, 10)        return self.ReturnPredictionValue(self.NB, sample)

当我运行以下代码时:

if __name__ == '__main__':sample = [41.0, 0.0, 2.0, 130.0, 204.0, 0.0, 2.0, 172.0, 0.0, 1.4, 1.0, 0.0, 3.0]p = Predict()print "Prediction value: {}".format(p.PredictionMain(sample))

结果是:

NB: fit score 0.84711, predict score 0.83607 CV scores mean: 0.8000

Prediction value: 1

我得到的结果是1而不是0(这个样本已经是数据集中的一个样本了)。我对数据集中的多个样本进行了同样的操作,大多数时候得到的结果都是错误的,好像准确率根本不是80%!

任何帮助都将不胜感激。提前感谢。


编辑:使用Pipeline解决了问题。最终代码如下:

import pandas as pdimport numpy as npfrom sklearn.naive_bayes import GaussianNBfrom sklearn.model_selection import cross_val_score, train_test_splitfrom sklearn.preprocessing import Imputer, StandardScaler, OneHotEncoderfrom sklearn.pipeline import Pipelineclass Predict:    def __init__(self):        self.X = []        self.Y = []    def Read_Clean(self,dataset):        header_row = ['Age', 'Gender', 'Chest_Pain', 'Resting_Blood_Pressure', 'Serum_Cholestrol',                      'Fasting_Blood_Sugar', 'Resting_ECG', 'Max_Heart_Rate',                      'Exercise_Induced_Angina', 'OldPeak',                      'Slope', 'CA', 'Thal', 'Num']        df = pd.read_csv(dataset, names=header_row)        df = df.replace('[?]', np.nan, regex=True)        df = pd.DataFrame(Imputer(missing_values='NaN', strategy='mean', axis=0)                          .fit_transform(df), columns=header_row)        df = df.astype(float)        return df    def Split_Dataset(self, df):        self.Y = df['Num'].apply(lambda x: 1 if x > 0 else 0)        self.X = df.drop('Num', axis=1)    def Create_Pipeline(self):        estimators = []        estimators.append(('standardize', StandardScaler()))        estimators.append(('bayes', GaussianNB()))        model = Pipeline(estimators)        return model    def Cross_Validate(self, clf, cv=5):        scores = cross_val_score(clf, self.X, self.Y, cv=cv, scoring='f1')        score = scores.mean()        print("CV scores mean: %.4f " % (score))    def Fit_Score(self, clf, label='x'):        clf.fit(self.X, self.Y)        fit_score = clf.score(self.X, self.Y)        print("%s: fit score %.5f" % (label, fit_score))    def ReturnPredictionValue(self, clf, sample):        y = clf.predict([sample])        return y[0]    def PredictionMain(self, sample, dataset_path='dataset/processed.cleveland.data'):        print "dataset: "+ dataset_path        data = self.Read_Clean(dataset_path)        self.Split_Dataset(data)        self.model = self.Create_Pipeline()        self.Fit_Score(self.model, label='NB')        self.Cross_Validate(self.model, 10)        return self.ReturnPredictionValue(self.model, sample)

现在对问题中相同的样本进行预测时,返回的是[0],这是真实值。实际上,通过运行以下方法:

def CheckTrue(self):    clf = self.Create_Pipeline()    out = cross_val_predict(clf, self.X, self.Y)    p = [out == self.Y]    c = 0    for i in range(303):        if p[0][i] == True:            c += 1    print "Samples with true values: {}".format(c)

我使用Pipeline代码得到了249个真实样本,而之前只有150个。


回答:

您没有对样本应用StandardScaler。分类器期望缩放后的数据,因为它是在StandardScaler.transform的输出上训练的,但样本没有以与训练相同的方式进行缩放。

在手动组合多个步骤(缩放、预处理、分类)时,很容易犯这样的错误。为了避免此类问题,最好使用scikit-learn的Pipeline

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注