我正在尝试使用朴素贝叶斯创建一个心脏病预测程序。当我完成分类器后,交叉验证显示平均准确率为80%。然而,当我尝试对给定样本进行预测时,预测结果完全错误!数据集是来自UCI存储库的心脏病数据集,包含303个样本。有两个类别:0表示健康,1表示患病。当我尝试对数据集中的一个样本进行预测时,除了极少数样本外,它都无法预测出其真实值。以下是代码:
import pandas as pdimport numpy as npfrom sklearn.naive_bayes import GaussianNBfrom sklearn.model_selection import cross_val_score, train_test_splitfrom sklearn.preprocessing import Imputer, StandardScalerclass Predict: def Read_Clean(self,dataset): header_row = ['Age', 'Gender', 'Chest_Pain', 'Resting_Blood_Pressure', 'Serum_Cholestrol', 'Fasting_Blood_Sugar', 'Resting_ECG', 'Max_Heart_Rate', 'Exercise_Induced_Angina', 'OldPeak', 'Slope', 'CA', 'Thal', 'Num'] df = pd.read_csv(dataset, names=header_row) df = df.replace('[?]', np.nan, regex=True) df = pd.DataFrame(Imputer(missing_values='NaN', strategy='mean', axis=0) .fit_transform(df), columns=header_row) df = df.astype(float) return df def Train_Test_Split_data(self,dataset): Y = dataset['Num'].apply(lambda x: 1 if x > 0 else 0) X = dataset.drop('Num', axis=1) validation_size = 0.20 seed = 42 X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed) return X_train, X_test, Y_train, Y_test def Scaler(self, X_train, X_test): scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) return X_train, X_test def Cross_Validate(self, clf, X_train, Y_train, cv=5): scores = cross_val_score(clf, X_train, Y_train, cv=cv, scoring='f1') score = scores.mean() print("CV scores mean: %.4f " % (score)) return score, scores def Fit_Score(self, clf, X_train, Y_train, X_test, Y_test, label='x'): clf.fit(X_train, Y_train) fit_score = clf.score(X_train, Y_train) pred_score = clf.score(X_test, Y_test) print("%s: fit score %.5f, predict score %.5f" % (label, fit_score, pred_score)) return pred_score def ReturnPredictionValue(self, clf, sample): y = clf.predict([sample]) return y[0] def PredictionMain(self, sample, dataset_path='dataset/processed.cleveland.data'): data = self.Read_Clean(dataset_path) X_train, X_test, Y_train, Y_test = self.Train_Test_Split_data(data) X_train, X_test = self.Scaler(X_train, X_test) self.NB = GaussianNB() self.Fit_Score(self.NB, X_train, Y_train, X_test, Y_test, label='NB') self.Cross_Validate(self.NB, X_train, Y_train, 10) return self.ReturnPredictionValue(self.NB, sample)
当我运行以下代码时:
if __name__ == '__main__':sample = [41.0, 0.0, 2.0, 130.0, 204.0, 0.0, 2.0, 172.0, 0.0, 1.4, 1.0, 0.0, 3.0]p = Predict()print "Prediction value: {}".format(p.PredictionMain(sample))
结果是:
NB: fit score 0.84711, predict score 0.83607 CV scores mean: 0.8000
Prediction value: 1
我得到的结果是1而不是0(这个样本已经是数据集中的一个样本了)。我对数据集中的多个样本进行了同样的操作,大多数时候得到的结果都是错误的,好像准确率根本不是80%!
任何帮助都将不胜感激。提前感谢。
编辑:使用Pipeline解决了问题。最终代码如下:
import pandas as pdimport numpy as npfrom sklearn.naive_bayes import GaussianNBfrom sklearn.model_selection import cross_val_score, train_test_splitfrom sklearn.preprocessing import Imputer, StandardScaler, OneHotEncoderfrom sklearn.pipeline import Pipelineclass Predict: def __init__(self): self.X = [] self.Y = [] def Read_Clean(self,dataset): header_row = ['Age', 'Gender', 'Chest_Pain', 'Resting_Blood_Pressure', 'Serum_Cholestrol', 'Fasting_Blood_Sugar', 'Resting_ECG', 'Max_Heart_Rate', 'Exercise_Induced_Angina', 'OldPeak', 'Slope', 'CA', 'Thal', 'Num'] df = pd.read_csv(dataset, names=header_row) df = df.replace('[?]', np.nan, regex=True) df = pd.DataFrame(Imputer(missing_values='NaN', strategy='mean', axis=0) .fit_transform(df), columns=header_row) df = df.astype(float) return df def Split_Dataset(self, df): self.Y = df['Num'].apply(lambda x: 1 if x > 0 else 0) self.X = df.drop('Num', axis=1) def Create_Pipeline(self): estimators = [] estimators.append(('standardize', StandardScaler())) estimators.append(('bayes', GaussianNB())) model = Pipeline(estimators) return model def Cross_Validate(self, clf, cv=5): scores = cross_val_score(clf, self.X, self.Y, cv=cv, scoring='f1') score = scores.mean() print("CV scores mean: %.4f " % (score)) def Fit_Score(self, clf, label='x'): clf.fit(self.X, self.Y) fit_score = clf.score(self.X, self.Y) print("%s: fit score %.5f" % (label, fit_score)) def ReturnPredictionValue(self, clf, sample): y = clf.predict([sample]) return y[0] def PredictionMain(self, sample, dataset_path='dataset/processed.cleveland.data'): print "dataset: "+ dataset_path data = self.Read_Clean(dataset_path) self.Split_Dataset(data) self.model = self.Create_Pipeline() self.Fit_Score(self.model, label='NB') self.Cross_Validate(self.model, 10) return self.ReturnPredictionValue(self.model, sample)
现在对问题中相同的样本进行预测时,返回的是[0],这是真实值。实际上,通过运行以下方法:
def CheckTrue(self): clf = self.Create_Pipeline() out = cross_val_predict(clf, self.X, self.Y) p = [out == self.Y] c = 0 for i in range(303): if p[0][i] == True: c += 1 print "Samples with true values: {}".format(c)
我使用Pipeline代码得到了249个真实样本,而之前只有150个。
回答:
您没有对样本应用StandardScaler。分类器期望缩放后的数据,因为它是在StandardScaler.transform的输出上训练的,但样本没有以与训练相同的方式进行缩放。
在手动组合多个步骤(缩放、预处理、分类)时,很容易犯这样的错误。为了避免此类问题,最好使用scikit-learn的Pipeline。