我读取了一个csv数据集,并使用pandas数据框存储数据,然后将数据分为训练集和测试集。我尝试使用每次一个特征来训练和预测准确性,以便之后可以找出4个特征中哪个是最好的预测器。我是Python和机器学习的新手,所以请耐心指导我。这实际上是我第一次尝试这两种技术。在这一行my_knn_for_cs4661.fit(X_train[col], y_train)
我遇到了一个错误,大约是关于array.reshape(-1,1)
的问题。我尝试过X_train[col].reshape(-1,1)
,但得到了一些其他的错误。我使用的是Python 3,在Jupyter Notebook上运行,使用了sklearn、numpy和pandas。
以下是我的代码和错误
from sklearn.model_selection import train_test_splitiris_df = pd.read_csv('https://raw.githubusercontent.com/mpourhoma/CS4661/master/iris.csv')feature_cols = ['sepal_length','sepal_width','petal_length','petal_width']X = iris_df[feature_cols] y = iris_df['species']predictions= {}X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=6)k = 3my_knn_for_cs4661 = KNeighborsClassifier(n_neighbors=k)for col in feature_cols: my_knn_for_cs4661.fit(X_train[col], y_train) y_predict = my_knn_for_cs4661.predict(X_test) predictions[col] = y_predict
我的错误:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-41-933eb8b496d8> in <module>() 13 for col in feature_cols: 14 ---> 15 my_knn_for_cs4661.fit(X_train[col], y_train) 16 y_predict = my_knn_for_cs4661.predict(X_test) 17 predictions[col] = y_predict~\Anaconda3\lib\site-packages\sklearn\neighbors\base.py in fit(self, X, y) 763 """ 764 if not isinstance(X, (KDTree, BallTree)):--> 765 X, y = check_X_y(X, y, "csr", multi_output=True) 766 767 if y.ndim == 1 or y.ndim == 2 and y.shape[1] == 1:~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator) 571 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite, 572 ensure_2d, allow_nd, ensure_min_samples,--> 573 ensure_min_features, warn_on_dtype, estimator) 574 if multi_output: 575 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 439 "Reshape your data either using array.reshape(-1, 1) if " 440 "your data has a single feature or array.reshape(1, -1) "--> 441 "if it contains a single sample.".format(array)) 442 array = np.atleast_2d(array) 443 # To ensure that array flags are maintainedValueError: Expected 2D array, got 1D array instead:array=[6. 5. 5.7 6.3 5.6 5.6 4.6 5.8 5.8 4.7 5.5 5.4 5.8 6.4 6.5 6.7 6.1 6.9 7.2 6.2 5.1 4.9 6.5 6.8 5.1 4.6 5.7 7.9 6.1 6.3 6.8 5.5 6.3 6.7 5.5 5. 7.3 4.4 5.3 4.8 4.5 4.6 5. 5.8 6.9 4.8 7.7 5.8 5.4 6.7 5.5 6.7 5.9 5.6 5. 6. 5.9 7. 5.4 4.9 5. 5.2 6. 5.1 6.1 6.2 5.6 6.7 6.8 5.8 6.7 5.7 7.2 5.4 7.4 4.4 6.2 6.5 5. 6.7 6.6 4.9 5. 6. 5.5 6.2 5.7 7.2 4.9 6. ].Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
回答:
我找到了一个解决方案,虽然看起来有点不太规范,不知道这是不是Pythonic的方式。
iris_df = pd.read_csv('https://raw.githubusercontent.com/mpourhoma/CS4661/master/iris.csv')feature_cols = ['sepal_length','sepal_width','petal_length','petal_width']X = iris_df[feature_cols] y = iris_df['species']predictions= {}X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=6)k = 3my_knn_for_cs4661 = KNeighborsClassifier(n_neighbors=k)for col in feature_cols: my_knn_for_cs4661.fit(X_train[col].values.reshape(-1,1), y_train) y_predict = my_knn_for_cs4661.predict(X_test[col].values.reshape(-1,1)) predictions[col] = accuracy_score(y_test, y_predict)print(predictions)