我有一个简单的数据库。首先,我尝试使用 train_test_split()
来分割数据集。然后,我尝试使用 KFold()
。代码如下所示:
def call(X_train: ndarray,X_test: ndarray,y_train: ndarray,y_test: ndarray,k: int,repetitions: int,) -> Dict: rep_sub = [] for reps in range(repetitions): fold_sub = [] kf = KFold(n_splits=k, shuffle=True) for train_index, test_index in kf.split(X_train): preds = LinearRegression().fit(X_train[train_index], y_train[train_index]).predict(X_test[test_index]) sub = preds - y_test[test_index] fold_sub.extend(sub) rep_sub.extend(fold_sub) return rep_subif __name__ == "__main__": X = np.array([[1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [3, 4], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) all_preds = call(X_train, X_test, y_train, y_test, k=2, repetitions=2)
我遇到了错误 IndexError: index 4 is out of bounds for axis 0 with size 3
您能解释一下我在这里做错了什么吗?我需要使用 5 fold holdout external validation
!
回答:
KFold(X)
将从提供的 X
中创建 k
个 (train, test)
索引子集。因此,您只能对 X
进行索引。您试图做的则是将 test
索引应用到非 X
的东西上。暂不评论您为什么要使用 train_test_split
和 KFold
的这种奇怪组合,您应该使用 test_index
来索引提供的 X_train
,或者干脆忽略它。以下是您可以使用它的两种方式(再次不评论您为什么要这样使用):
情况1
preds = LinearRegression().fit( X_train[train_index], y_train[train_index]).predict(X_train[test_index])sub = preds - y_train[test_index]
情况2
preds = LinearRegression().fit( X_train[train_index], y_train[train_index]).predict(X_test)sub = preds - y_test