我已经阅读了很多关于这个特定错误的信息,但一直没有找到解决我问题的答案。我有一个数据集,已经将其分为训练集和测试集,并打算运行一个KNeighborsClassifier。下面是我的代码… 我的问题是,当我查看X_train的数据类型时,我根本没有看到任何字符串格式的列。我的y_train是一个单一的分类变量。这是我的第一个stackoverflow帖子,如果我忽略了任何礼节,我深表歉意,感谢您的帮助!:)
错误:
TypeError: unorderable types: str() > float()
数据类型:
X_train.dtypes.value_counts()Out[54]: int64 2035float64 178dtype: int64
代码:
# 导入包 import osimport pandas as pd import numpy as npimport matplotlib.pyplot as pltfrom sklearn.dummy import DummyRegressorfrom sklearn.cross_validation import train_test_split, KFoldfrom matplotlib.ticker import FormatStrFormatterfrom sklearn import cross_validationfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.svm import SVCimport pdb# 设置目录路径 path = "file_path"os.chdir(path)#选择导入文件data = 'RawData2.csv' delim = ','#导入数据文件df = pd.read_csv(data, sep = delim)print (df.head())df.columns.get_loc('Categories')#模型 #选择/更新特征X = df[df.columns[14:2215]]#获取目标变量的列索引df.columns.get_loc('Categories')#选择目标并用"Small"填充na's labely = y[y.columns[21]]print(y.values)y.fillna('Small')#训练/测试集X_sample = X.loc[X.Var1 <1279]X_valid = X.loc[X.Var1 > 1278]y_sample = y.head(len(X_sample))y_test = y.head(len(y)-len(X_sample))X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size = 0.2)cv = KFold(n = X_train.shape[0], n_folds = 5, random_state = 17)print(X_train.shape, y_train.shape)X_train.dtypes.value_counts()from sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_scoreknn = KNeighborsClassifier(n_neighbors = 5)knn.fit(X_train, y_train) **<-- 错误在这里被标记** accuracy_score(knn.predict(X_test))
回答:
sklearn中的一切都基于numpy,而numpy仅使用数字。因此,分类变量X和Y需要编码为数字。对于X,你可以使用get_dummies。对于Y,你可以使用LabelEncoder。
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html