k最近邻算法结合交叉验证用于准确率评分和混淆矩阵

我有一组数据,每列中的数字行是输入,字母是输出。

A,A,A,B,B,B-0.979090189,0.338819904,-0.253746508,0.213454999,-0.580601104,-0.441683968-0.48395313,0.436456904,-1.427424032,-0.107093825,0.320813402,0.060866105-1.098818173,-0.999161692,-1.371721698,-1.057324962,-1.161752652,-0.854872591-1.53191442,-1.465454248,-1.350414216,-1.732518018,-1.674040715,-1.5615684962.522796162,2.498153298,3.11756171,2.125738509,3.003929536,2.514411247-0.060161596,-0.487513844,-1.083513761,-0.908023322,-1.047536921,-0.482767590.241962669,0.181365373,0.174042637,-0.048013217,-0.177434916,0.42738621-0.603856395,-1.020531402,-1.091134021,-0.863008165,-0.683233589,-0.849059931-0.626159165,-0.348144322,-0.518640038,-0.394482485,-0.249935646,-0.543947259-1.407263942,-1.387660115,-1.612988118,-1.141282747,-0.944745366,-1.030944216-0.682567673,-0.043613473,-0.105679403,0.135431139,0.059104888,-0.132060832-1.10107164,-1.030047313,-1.239075022,-0.651818656,-1.043589073,-0.765992541

我尝试使用KNN留一法交叉验证(LOOCV)来获取准确率评分和混淆矩阵。

from sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import confusion_matrixfrom sklearn.model_selection import LeaveOneOutimport pandas as pddef main():  csv = 'data.csv'  df = pd.read_csv(csv)  X = df.values.T  y = df.columns.values  clf = KNeighborsClassifier()  loo = LeaveOneOut()  for train_index, test_index in loo.split(X):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]    clf.fit(X_train, y_train)    y_true = y_test    y_pred = clf.predict(X_test)    ac = accuracy_score(y_true, y_pred)    cm = confusion_matrix(y_true, y_pred)    print ac    print cmif __name__ == '__main__':  main()

然而,我的结果都是0。我哪里做错了?


回答:

我认为你的模型没有正确训练,因为它只需要猜测一个值,所以猜不对。我建议你改用KFold或StratifiedKFold。留一法交叉验证(LOO)对于大样本来说非常耗时。以下是我在你的X数据上使用3折分层的StratifiedKFold的结果。我随机用0和1填充了y,而不是使用A和B,并且没有转置数据,所以它有12行:

from sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import confusion_matrixfrom sklearn.model_selection import StratifiedKFoldimport pandas as pdcsv = 'C:\df_low_X.csv'df = pd.read_csv(csv, header=None)print(df)X = df.iloc[:, :-1].valuesy = df.iloc[:, -1].valuesclf = KNeighborsClassifier()kf = StratifiedKFold(n_splits = 3)ac = []cm = []for train_index, test_index in kf.split(X,y):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]    print(X_train, X_test)    clf.fit(X_train, y_train)    y_pred = clf.predict(X_test)    ac.append(accuracy_score(y_test, y_pred))    cm.append(confusion_matrix(y_test, y_pred))print(ac)print(cm)# ac[0.25, 0.75, 0.5]# cm[array([[1, 1],       [2, 0]], dtype=int64), array([[1, 1],       [0, 2]], dtype=int64), array([[0, 2],       [0, 2]], dtype=int64)]

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注