我在尝试使用imblearn绘制ROC曲线时遇到了些问题。
这是我的数据截图
from imblearn.over_sampling import SMOTE, ADASYNfrom collections import Counterimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom itertools import cycleimport sysfrom sklearn import svm, datasetsfrom sklearn.metrics import roc_curve, aucfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import label_binarizefrom sklearn.multiclass import OneVsRestClassifierfrom scipy import interpfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.tree import DecisionTreeClassifier# Import some data to play withdf = pd.read_csv("E:\\autodesk\\Hourly and weather ml.csv")# X and y are different columns of the input data. Input X as numpy arrayX = df[['TTI','Max TemperatureF','Mean TemperatureF','Min TemperatureF',' Min Humidity']].values# # Reshape X. Do this if X has only one value per data point. In this case, TTI.# # Input y as normal listy = df['TTI_Category'].as_matrix()X_resampled, y_resampled = SMOTE().fit_sample(X, y)y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])n_classes = y_resampled.shape[1]# shuffle and split training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)# Learn to predict each class against the otherclassifier = OneVsRestClassifier(DecisionTreeClassifier(random_state=0))y_score=classifier.fit(X_resampled, y_resampled).predict_proba(X_test)# Compute ROC curve and ROC area for each classfpr = dict()tpr = dict()roc_auc = dict()for i in range(n_classes): fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i]) roc_auc[i] = auc(fpr[i], tpr[i])# Compute micro-average ROC curve and ROC areafpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])plt.figure()
我将原来的X_train和y_train
改成了X_resampled, y_resampled
,因为训练应该在重采样的数据集上进行,而测试需要在原始的测试数据集上进行。然而,我得到了以下错误信息:
runfile('E:/autodesk/SMOTE with multiclass.py', wdir='E:/autodesk')Traceback (most recent call last): File "<ipython-input-128-efb16ffc92ca>", line 1, in <module> runfile('E:/autodesk/SMOTE with multiclass.py', wdir='E:/autodesk') File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile execfile(filename, namespace) File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile exec(compile(scripttext, filename, 'exec'), glob, loc) File "E:/autodesk/SMOTE with multiclass.py", line 51, in <module> fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])IndexError: too many indices for array
我添加了一行代码来对y_resampled和原始的y进行二值化处理,其余部分保持不变,但我不知道我是否正确地在重采样的数据上进行训练,并在原始数据上进行测试。
X_resampled, y_resampled = SMOTE().fit_sample(X, y)y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])y = label_binarize(y, classes=['Good','Bad','Ok'])n_classes = y.shape[1]
非常感谢您的帮助。
回答:
首先,让我们讨论一下这个错误。你做了以下操作:
y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])n_classes = y_resampled.shape[1]
所以你的n_classes
实际上是3。
在后续部分,你做了以下操作:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
这里你使用了原始的y
,而不是y_resampled
。所以当前的y_test
是一个形状为(n_samples,)
的一维数组,或者可能是形状为(n_samples, 1)
的列向量。
在for循环中,你从0到3(n_classes)开始迭代,这对于y_test
来说是不可能的,因此出现了错误,表明你试图访问的y_test
中的索引不存在。
其次,你应该先将数据分为训练集和测试集,然后只对训练部分进行重采样。
所以以下代码应该能实现你想要的效果:
# 首先将数据分为训练集和测试集X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)# 然后只对训练数据进行重采样X_resampled, y_resampled = SMOTE().fit_sample(X_train, y_train)# 然后对它们进行二值化处理,以便在多类ROC中使用y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])# 对测试数据也进行同样的处理y_test = label_binarize(y_test, classes=['Good','Bad','Ok'])y_score=classifier.fit(X_resampled, y_resampled).predict_proba(X_test)# 然后你可以执行以下操作和其他部分的代码for i in range(n_classes): fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i]) roc_auc[i] = auc(fpr[i], tpr[i])