我有10多个特征和数万个案例来训练一个逻辑回归模型,用于分类人的种族。第一个例子是法国人与非法国人,第二个例子是英国人与非英国人。结果如下:
//////////////////////////////////////////////////////1= fr0= non-frClass count:0 691091 30891dtype: int64Accuracy: 0.95126Classification report: precision recall f1-score support 0 0.97 0.96 0.96 34547 1 0.92 0.93 0.92 15453avg / total 0.95 0.95 0.95 50000Confusion matrix:[[33229 1318] [ 1119 14334]]AUC= 0.944717975754//////////////////////////////////////////////////////1= en0= non-enClass count:0 761251 23875dtype: int64Accuracy: 0.7675Classification report: precision recall f1-score support 0 0.91 0.78 0.84 38245 1 0.50 0.74 0.60 11755avg / total 0.81 0.77 0.78 50000Confusion matrix:[[29677 8568] [ 3057 8698]]AUC= 0.757955582999//////////////////////////////////////////////////////
然而,我得到了一些非常奇怪的AUC曲线,它们呈现出三角形而不是锯齿状的圆形曲线。为什么会出现这种形状?有什么可能的错误吗?
代码:
all_dict = [] for i in range(0, len(my_dict)): temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items() + my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items() + my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items() + my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items() ) all_dict.append(temp_dict) newX = dv.fit_transform(all_dict) # Separate the training and testing data sets half_cut = int(len(df)/2.0)*-1 X_train = newX[:half_cut] X_test = newX[half_cut:] y_train = y[:half_cut] y_test = y[half_cut:] # Fitting X and y into model, using training data #$$ lr.fit(X_train, y_train) # Making predictions using trained data #$$ y_train_predictions = lr.predict(X_train) #$$ y_test_predictions = lr.predict(X_test) #print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0]) print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0]) print 'Classification report:' print classification_report(y_test, y_test_predictions) #print sk_confusion_matrix(y_train, y_train_predictions) print 'Confusion matrix:' print sk_confusion_matrix(y_test, y_test_predictions) #print y_test[1:20] #print y_test_predictions[1:20] #print y_test[1:10] #print np.bincount(y_test) #print np.bincount(y_test_predictions) # Find and plot AUC false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions) roc_auc = auc(false_positive_rate, true_positive_rate) print 'AUC=',roc_auc plt.title('Receiver Operating Characteristic') plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc) plt.legend(loc='lower right') plt.plot([0,1],[0,1],'r--') plt.xlim([-0.1,1.2]) plt.ylim([-0.1,1.2]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()
回答:
你做错了。根据文档说明:
y_score : array, shape = [n_samples] Target scores, can either be probability estimates of the positive class or confidence values.
因此在这一行:
roc_curve(y_test, y_test_predictions)
你应该传递decision_function
的结果(或者predict_proba
结果中的两列之一)给roc_curve
函数,而不是实际的预测结果。