我有两个数据集并应用了5种不同的机器学习模型。
数据集1:
def dataset_1(): ... ... bike_data_hours = bike_data_hours[:500] X = bike_data_hours.iloc[:, :-1].values y = bike_data_hours.iloc[:, -1].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) return X_train, X_test, y_train.reshape(-1, 1), y_test.reshape(-1, 1)
形状为(400, 14) (100, 14) (400, 1) (100, 1)
。数据类型为dtypes: object
(int64, float64)。
数据集2:
def dataset_2(): ... ... final_movie_df = final_movie_df[:500] X = final_movie_df.iloc[:, :-1] y = final_movie_df.iloc[:, -1] gs = GroupShuffleSplit(n_splits=2, test_size=0.2) train_ix, test_ix = next(gs.split(X, y, groups=X.UserID)) X_train = X.iloc[train_ix] y_train = y.iloc[train_ix] X_test = X.iloc[test_ix] y_test = y.iloc[test_ix] return X_train.shape, X_test.shape, y_train.values.reshape(-1,1).shape, y_test.values.reshape(-1,1).shape
形状为(400, 25) (100, 25) (400, 1) (100, 1)
。数据类型为dtypes: object
(int64, float64)。
我使用了不同的模型。代码如下:
X_train, X_test, y_train, y_test = dataset fold_residuals, fold_dfs = [], [] kf = KFold(n_splits=k, shuffle=True) for train_index, _ in kf.split(X_train): if reg_name == "RF" or reg_name == "SVR": preds = regressor.fit(X_train[train_index], y_train[train_index].ravel()).predict(X_test) elif reg_name == "Knn-5": preds = regressor.fit(X_train[train_index], np.ravel(y_train[train_index], order="C")).predict(X_test) else: preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test)
但我遇到了一个常见的错误,类似于这个,这个,和这个。我已经查看了所有这些帖子,但对于错误没有头绪。我已经使用了iloc
和values
,这是访问链接时给出的解决方案。
preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test) File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3030, in __getitem__ indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1] File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing) File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1308, in _validate_read_indexer raise KeyError(f"None of [{key}] are in the [{axis_name}]")KeyError: "None of [Int64Index([ 0, 1, 3, 4, 5, 6, 7, 9, 10, 11,\n ...\n 387, 388, 389, 390, 391, 392, 393, 395, 397, 399],\n dtype='int64', length=320)] are in the [columns]"
如果我使用train_test_split
替代GroupShuffleSplit
,代码就能正常工作。然而,我希望基于UserID
使用GroupShuffleSplit
,以便同一个用户不会同时被分配到训练集和测试集。你能告诉我如何在使用GroupShuffleSplit
时解决这个问题吗?
你能告诉我为什么dataset_2
会出错而dataset_1
运行完全正常(并且shape
和dtypes
对两个数据集来说是相同的)吗?
回答:
你需要对数据集2使用values
。进行以下更改:
X_train = X.iloc[train_ix].values y_train = y.iloc[train_ix].values X_test = X.iloc[test_ix].values y_test = y.iloc[test_ix].values return X_train.shape, X_test.shape, y_train.reshape(-1,1).shape, y_test.reshape(-1,1).shape
希望现在可以工作