我遇到了一个问题,3个不同的分类器都在同一个数据集(sklearn的iris数据集)上进行训练,输出的准确率得分和混淆矩阵完全相同。我给我的教授发了邮件,询问这是否是正常现象,以及如果不是的话她有什么建议,她的回答基本上是“这是不正常的,回去检查你的代码”。
自那以后,我已经对我的代码进行了相当多的检查,但我似乎看不出哪里出了问题。我希望这里的某个人能为我解惑,让我能从这次经历中学到一些东西。
这是我的代码:
# Datasetfrom sklearn import datasets# Data Preprocessingfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScaler# Classifiersfrom sklearn.svm import SVCfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.linear_model import LogisticRegression# Performance Metricsfrom sklearn.metrics import confusion_matrix, accuracy_scoreif __name__ == '__main__': # Read dataset into memory. iris = datasets.load_iris() # Extract independent and dependent variables into variables. X = iris.data y = iris.target # Split training and test sets (70/30). X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0) # Fit the scaler to the training set, and transform both the training and test sets dependent # columns, which are all of them since none of the dependent variables contain categorical data. ss = StandardScaler() X_train = ss.fit_transform(X_train) X_test = ss.transform(X_test) # Create the classifiers. dt_classifier = DecisionTreeClassifier(random_state=0) svm_classifier = SVC(kernel='rbf', random_state=0) lr_classifier = LogisticRegression(random_state=0) # Fit the classifiers to the training data. dt_classifier.fit(X_train, y_train) svm_classifier.fit(X_train, y_train) lr_classifier.fit(X_train, y_train) # Predict using the now trained classifiers. dt_y_pred = dt_classifier.predict(X_test) svm_y_pred = svm_classifier.predict(X_test) lr_y_pred = lr_classifier.predict(X_test) # Create confusion matrices using the predicted results and the actual results from the test set. dt_cm = confusion_matrix(y_test, dt_y_pred) svm_cm = confusion_matrix(y_test, svm_y_pred) lr_cm = confusion_matrix(y_test, lr_y_pred) # Calculate accuracy scores using the predicted results and the actual results from the test set. dt_score = accuracy_score(y_test, dt_y_pred) svm_score = accuracy_score(y_test, svm_y_pred) lr_score = accuracy_score(y_test, lr_y_pred) # Print confusion matrices and accuracy scores for each classifier. print('--- Decision Tree Classifier ---') print(f'Confusion Matrix:\n{dt_cm}') print(f'Accuracy Score:{dt_score}\n') print('--- Support Vector Machine Classifier ---') print(f'Confusion Matrix:\n{svm_cm}') print(f'Accuracy Score:{svm_score}\n') print('--- Logistic Regression Classifier ---') print(f'Confusion Matrix:\n{lr_cm}') print(f'Accuracy Score:{lr_score}')
输出如下:
--- Decision Tree Classifier ---Confusion Matrix:[[16 0 0] [ 0 17 1] [ 0 0 11]]Accuracy Score:0.9777777777777777--- Support Vector Machine Classifier ---Confusion Matrix:[[16 0 0] [ 0 17 1] [ 0 0 11]]Accuracy Score:0.9777777777777777--- Logistic Regression Classifier ---Confusion Matrix:[[16 0 0] [ 0 17 1] [ 0 0 11]]Accuracy Score:0.9777777777777777
如您所见,每个不同分类器的输出完全相同。任何形式的帮助都将不胜感激。
回答:
你的代码没有任何问题。
在以下情况下,结果的相似性并不令人意外:
- 数据相对“简单”
- 样本量太小
这两个前提在这里都成立。iris数据集众所周知对于现代机器学习算法(包括你在这里使用的算法)来说非常容易分类;再加上你的测试集规模极小(仅45个样本),这样的结果并不令人惊讶。
事实上,只需将数据分割更改为使用test_size=0.20
,你将从所有3个模型中获得完美的1.0准确率。
无需担心。