我在尝试为机器学习模型编写一个分类算法,但出现了错误。有人能帮忙吗?提前感谢
import pandas as pdfrom sklearn.metrics import accuracy_scorefrom scipy.spatial import distancedef euc(a, b): return distance.euclidean(a,b)class classifierKN(): def fit(self, X_train, Y_train): self.X_train = X_train self.Y_train = Y_train def predict(self, X_test): predictions = [] for row in X_test: label = self.closest(row) predictions.append(label) return predictions def closest(self, row): best_dist = euc(row, self.X_train[0]) best_index = 0 for i in range(1, len(self.X_train)): dist = euc(row, self.X_train[i]) if dist < best_dist: best_dist = dist best_index = i return self.Y_train[best_index]#Load the dataset diabetdata = pd.read_csv("diabetes.csv")#set features and targetfeatures = ["PlasmaGlucose", "DiastolicBloodPressure", "TricepsThickness", "SerumInsulin"]X = diabetdata[features]print("FEATURES: " , X.head())Y = diabetdata.Diabeticprint("TARGET: " , Y.head())print("")from sklearn.model_selection import train_test_split #No module named 'sklearn.cross_validation' so I replace it with model_selectionX_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state=0)#predict model= classifierKN()model.fit(X_train,Y_train)predictKN = model.predict(X)print ("Predict result with KNeighborsClassifier")print(predictKN)#accuracyprint("Accuracy")print (accuracy_score(Y, predictKN))
结果
在处理上述异常时,发生了另一个异常:Traceback (most recent call last): File "C:\Users\Vlad\Desktop\Machine learning\Machine Learning\coursework\test2.py", line 63, in <module> predictKN = model.predict(X) File "C:\Users\Vlad\Desktop\Machine learning\Machine Learning\coursework\test2.py", line 26, in predict label = self.closest(row) File "C:\Users\Vlad\Desktop\Machine learning\Machine Learning\coursework\test2.py", line 30, in closest best_dist = euc(row, self.X_train[0]) File "E:\Anaconda\lib\site-packages\pandas\core\frame.py", line 2800, in __getitem__ indexer = self.columns.get_loc(key) File "E:\Anaconda\lib\site-packages\pandas\core\indexes\base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas\_libs\hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_itemKeyError: 0
回答:
你的代码实际上存在多个问题,同时理解起来有点困难。你的问题主要似乎与你对pandas数据框/系列的理解有关,因为你显然试图用以下方式迭代数据框的行:
def predict(self, X_test): predictions = [] for row in X_test: label = self.closest(row) predictions.append(label) return predictions
这在pandas中行不通。要实际迭代行的值,你需要像这样做:
def predict(self, X_test): predictions = [] for row in X_test.iterrows(): label = self.closest(list(row[1])) predictions.append(label) return predictions
这个函数实际上会迭代数据框中的行,并将行的值传递给closest()
函数。
def closest(self, row): best_dist = euc(row, self.X_train[0]) best_index = 0 for i in range(1, len(self.X_train)): dist = euc(row, self.X_train[i]) if dist < best_dist: best_dist = dist best_index = i return self.Y_train[best_index]
然而,这个函数不起作用,因为你基本上是在尝试用best_dist = euc(row, self.X_train[0])
获取row[0]的值。这会抛出一个KeyError,因为X_train是一个数据框,没有0列(无论如何你也不想索引该列)。你想要的是输入行与数据框中第一行的距离作为默认的best_dist。这可以通过best_dist = euc(row, self.X_train.iloc[0])
来实现。然后你需要迭代X_train中的行(这里你的函数有同样的问题),所以你需要将其更改为类似于:
def closest(self, row): best_dist = euc(row, self.X_train.iloc[0]) best_index = 0 for i in range(1, len(self.X_train.index)): dist = euc(row, list(self.X_train.iloc[i])) if dist < best_dist: best_dist = dist best_index = i return self.Y_train.iloc[best_index]
这至少是可行的。是否能给你想要的输出和/或足够的准确性,我无法保证,但它确实解决了你的直接问题。