我想在不使用sklearn库的情况下分割我的数据集。以下是我使用的方法。
我当前的代码:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
我尝试过的方法:
def non_shuffling_train_test_split(X, y, test_size=0.2): i = int((1 - test_size) * X.shape[0]) + 1 X_train, X_test = np.split(X, [i]) y_train, y_test = np.split(y, [i]) return X_train, X_test, y_train, y_test
然而,上述代码并未实现随机化。
回答:
您可以使用np.random.permutation
创建一个打乱的顺序,然后使用np.take
进行子集选择,这对numpy数组和pandas数据框都适用:
def tt_split(X, y, test_size=0.2): i = int((1 - test_size) * X.shape[0]) o = np.random.permutation(X.shape[0]) X_train, X_test = np.split(np.take(X,o,axis=0), [i]) y_train, y_test = np.split(np.take(y,o), [i]) return X_train, X_test, y_train, y_test
在numpy数组上测试:
X = np.random.normal(0,1,(50,10))y = np.random.normal(0,1,(50,))X_train, X_test, y_train, y_test = tt_split(X,y)[X_train.shape,y_train.shape][(40, 10), (40,)]
在pandas数据框上测试:
X = pd.DataFrame(np.random.normal(0,1,(50,10)))y = pd.Series(np.random.normal(0,1,50))X_train, X_test, y_train, y_test = tt_split(X,y)[X_train.shape,y_train.shape][(40, 10), (40,)]