我刚开始使用scikit learn,感觉像是撞了南墙。我使用了真实世界和测试数据,但scikit的算法在预测任何东西时都无法超过随机水平。我尝试了knn、决策树、svc和朴素贝叶斯。
基本上,我创建了一个测试数据集,其中一列是0和1,所有0的值在0到0.5之间,所有1的值在0.5到1之间。这应该非常简单,几乎可以达到100%的准确率。然而,没有一个算法的表现能超过随机水平。准确率在45%到55%之间。我已经尝试调整了每种算法的许多参数,但没有任何帮助。我认为我的实现中可能存在根本性的问题。
请帮帮我。这是我的代码:
from sklearn.cross_validation import train_test_splitfrom sklearn import preprocessingfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.metrics import accuracy_scoreimport sklearnimport pandasimport numpy as npdf=pandas.read_excel('Test.xlsx') # Make data into np arraysy = np.array(df[1])y=y.astype(float) y=y.reshape(399)x = np.array(df[2])x=x.astype(float) x=x.reshape(399, 1)# Creating training and test datalabels_train, labels_test = train_test_split(y)features_train, features_test = train_test_split(x)###################################################################### PERCEPTRON#####################################################################from sklearn import linear_modelperceptron=linear_model.Perceptron()perceptron.fit(features_train, labels_train)perc_pred=perceptron.predict(features_test)print sklearn.metrics.accuracy_score(labels_test, perc_pred, normalize=True, sample_weight=None)print 'perceptron'###################################################################### KNN classifier#####################################################################from sklearn.neighbors import KNeighborsClassifierknn = KNeighborsClassifier()knn.fit(features_train, labels_train) knn_pred = knn.predict(features_test)# Accuraatheidprint sklearn.metrics.accuracy_score(labels_test, knn_pred, normalize=True, sample_weight=None) print 'knn'####################################################################### SVC#####################################################################from sklearn.svm import SVCfrom sklearn import svmsvm2 = SVC(kernel="linear")svm2 = svm.SVC()svm2.fit(features_train, labels_train) SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,gamma=1.0, kernel='linear', max_iter=-1, probability=False, random_state=None,shrinking=True, tol=0.001, verbose=False)svc_pred = svm2.predict(features_test)print sklearn.metrics.accuracy_score(labels_test, svc_pred, normalize=True, sample_weight=None)###################################################################### Decision tree#####################################################################from sklearn import treeclf = tree.DecisionTreeClassifier()clf = clf.fit(features_train, labels_train)tree_pred=clf.predict(features_test)# Accuraatheidprint sklearn.metrics.accuracy_score(labels_test, tree_pred, normalize=True, sample_weight=None)print 'tree'###################################################################### Naive bayes#####################################################################import sklearnfrom sklearn.naive_bayes import GaussianNBclf = GaussianNB()clf.fit(features_train, labels_train)print "training time:", round(time()-t0, 3), "s"GaussianNB()bayes_pred = clf.predict(features_test)print sklearn.metrics.accuracy_score(labels_test, bayes_pred, normalize=True, sample_weight=None)
回答:
你似乎错误地使用了train_test_split。
labels_train, labels_test = train_test_split(y) #错误features_train, features_test = train_test_split(x) #错误
你对标签和数据的分割可能不是相同的。一个简单的手动分割数据的方法是:
randomvec=np.random.rand(len(data)) randomvec=randomvec>0.5train_data=data[randomvec]train_label=labels[randomvec]test_data=data[np.logical_not(randomvec)]test_label=labels[np.logical_not(randomvec)]
或者正确使用scikit方法:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=42)