我已经在银行贷款数据上实现了逻辑回归。我使用了GridSearchCV进行超参数调优,并在多个kfolds = [3,5,6]上实现了逻辑回归。这是我的代码
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns#from google.colab import filesimport ioimport warningswarnings.filterwarnings('ignore')#uploaded = files.upload()df = pd.read_csv('CleanedLoanData13Cols.csv')from sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import GridSearchCVX = df.drop('loan_status', axis=1, inplace=False)y = df['loan_status']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)parameters = {'penalty': ['l1', 'l2','elasticnet'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'solver' : ['liblinear', 'newton-cg', 'lbfgs', 'saga', 'sag'], 'multi_class' : ['auto'], 'max_iter' : [5,15,25] }import warningswarnings.filterwarnings("ignore")cv_folds = [3, 5, 6]s_scaler = StandardScaler()#m_scaler = MinMaxScaler()#r_scaler = RobustScaler()s_scaled_X_train = s_scaler.fit_transform(X_train)s_scaled_X_test = s_scaler.transform(X_test)for x in cv_folds: logmodel = GridSearchCV(LogisticRegression(random_state = 42), parameters, cv = x, scoring = 'accuracy', refit = True) logmodel.fit(X_train, y_train) print('The best score with CV =', x, 'is', logmodel.score(X_test, y_test), 'with parameters =\n\n', logmodel.best_params_, '\n\n')
输出:(第一个问题:这看起来不对劲,请纠正我如果我错了?)
The best score with CV = 3 is 0.929636746271388 with parameters = {'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'} The best score with CV = 5 is 0.929636746271388 with parameters = {'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'} The best score with CV = 6 is 0.929636746271388 with parameters = {'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
继续
results = logmodel.cv_results_print(results.get('params'))print(results.get('mean_test_score'))
输出:
[0.9084348 nan nan 0.8323203 nan 0.83239873 0.83671225 0.8323203 0.8323203 0.8323203 nan nan nan nan nan 0.91647373 nan nan 0.8323203 nan 0.902435 0.89474906 0.8520445 0.8323203 and so on
继续:
print(results.get('mean_train_score'))
输出: None
print(logmodel.best_params_)
{‘C’: 0.001, ‘max_iter’: 25, ‘multi_class’: ‘auto’, ‘penalty’: ‘l2’, ‘solver’: ‘liblinear’}
print(logmodel.best_score_)
输出: 0.9226303384209481(我认为这里也出了问题,因为这和分类报告中的准确率不匹配)
final_model = logmodel.best_estimator_s_predictions = final_model.predict(s_scaled_X_test)from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrixprint(classification_report(y_test, s_predictions))print(confusion_matrix(y_test, s_predictions))
输出:这里的准确率是0.62,而上面是92
precision recall f1-score support 0 0.88 0.64 0.74 9197 1 0.22 0.53 0.31 1732 accuracy 0.62 10929 macro avg 0.55 0.59 0.53 10929weighted avg 0.77 0.62 0.67 10929[[5902 3295] [ 812 920]]
我不知道哪里出错了?我已经为此绞尽脑汁几个小时了,我无法理解我哪里做错了?如果有人能对此提供意见,我将非常感谢?
回答:
这里的问题是你在未缩放的数据 X_train, y_train
上拟合模型。
logmodel.fit(X_train, y_train)
然后你尝试在缩放后的数据 s_scaled_X_test
上进行预测,这解释了性能的下降。
s_predictions = final_model.predict(s_scaled_X_test)
要解决这个问题,你应该使用缩放后的数据训练模型,如下所示:
logmodel.fit(s_scaled_X_train, y_train)