我想知道这是否是一种计算分类准确率的合法方法:
- 获取精确率和召回率的阈值
- 对于每个阈值,将连续的y_scores二值化
- 从列联表(混淆矩阵)中计算它们的准确率
-
返回阈值的平均准确率
recall, precision, thresholds = precision_recall_curve(np.array(np_y_true), np.array(np_y_scores))accuracy = 0for threshold in thresholds: contingency_table = confusion_matrix(np_y_true, binarize(np_y_scores, threshold=threshold)[0]) accuracy += (float(contingency_table[0][0]) + float(contingency_table[1][1]))/float(np.sum(contingency_table))print "Classification accuracy is: {}".format(accuracy/len(thresholds))
回答:
你正朝着正确的方向前进。混淆矩阵无疑是计算分类器准确率的正确起点。我觉得你似乎在瞄准接收者操作特征曲线(ROC曲线)。
在统计学中,接收者操作特征曲线(ROC曲线)是一种图形绘制方法,用于展示二元分类器系统在其判别阈值变化时的性能。https://en.wikipedia.org/wiki/Receiver_operating_characteristic
AUC(曲线下面积)是衡量分类器性能的指标。更多信息和解释可以在这里找到:
https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it
http://mlwiki.org/index.php/ROC_Analysis
这是我的实现,你可以改进或评论:
def auc(y_true, y_val, plot=False): #check inputif len(y_true) != len(y_val): raise ValueError('Label vector (y_true) and corresponding value vector (y_val) must have the same length.\n')#empty arrays, true positive and false positive numberstp = []fp = []#count 1's and -1's in y_truecond_positive = list(y_true).count(1)cond_negative = list(y_true).count(-1)#all possibly relevant bias parameters stored in a listbias_set = sorted(list(set(y_val)), key=float, reverse=True)bias_set.append(min(bias_set)*0.9)#initialize y_pred array full of negative predictions (-1)y_pred = np.ones(len(y_true))*(-1)#the computation time is mainly influenced by this for loop#for a contamination rate of 1% it already takes ~8s to terminatefor bias in bias_set: #"lower values tend to correspond to label −1" #indices of values which exceed the bias posIdx = np.where(y_val > bias) #set predicted values to 1 y_pred[posIdx] = 1 #the following function simply calculates results which enable a distinction #between the cases of true positive and false positive results = np.asarray(y_true) + 2*np.asarray(y_pred) #append the amount of tp's and fp's tp.append(float(list(results).count(3))) fp.append(float(list(results).count(1)))#calculate false positive/negative ratetpr = np.asarray(tp)/cond_positivefpr = np.asarray(fp)/cond_negative#optional scatterplotif plot == True: plt.scatter(fpr,tpr) plt.show()#calculate AUCAUC = np.trapz(tpr,fpr)return AUC