我的数据集中响应值分布不均衡,合格的被拒绝样本数量远多于未被拒绝的样本,因此我想平衡我的数据集。
为此,我之前使用了现在已废弃的 cross_validation.StratifiedKFold
的代码,现在需要对其进行调整,但我对其理解不完全,所以我在寻求帮助。
原代码如下:
def stratified_cv(X, y, clf_class, shuffle=True, n_folds=10, **kwargs): stratified_k_fold = cross_validation.StratifiedKFold(y, n_folds=n_folds, shuffle=shuffle) y_pred = y.copy() # ii -> 训练集索引 # jj -> 测试集索引 for ii, jj in stratified_k_fold: X_train, X_test = X[ii], X[jj] y_train = y[ii] clf = clf_class(**kwargs) clf.fit(X_train,y_train) y_pred[jj] = clf.predict(X_test) return y_pred
其中 X
是经过 fit_transform 处理、转换为 numpy 浮点数组并进行缩放的数据集,y
是“被拒绝”与“未被拒绝”的分类,转换为整数数组(当然是0或1)。最后,clf_class(**kwargs)
可以是像 ensemble.GradientBoostingClassifier
、svm.SVC
和 ensemble.RandomForestClassifier
这样的分类器
X = np.array([[-0.6786493 , 0.67648946, -0.52360328, -0.32758048, 1.6170861 , 1.23488274, 1.56676695, 0.47664315, 1.56703625, -0.07060962, -0.05594035, -0.07042665, 0.86674322, -0.46549436, 0.86602851, -0.08500823, -0.60119509, -0.0856905 , -0.42793202],[0.6031696 , 0.14906505, -0.52360328, -0.32758048, 1.6170861 , 1.30794844, -0.33373776, 1.12450284, -0.33401297, -0.10808036, 0.14486653, -0.10754944, 1.05857074, 0.14782467, 1.05938994, 1.24048169, -0.60119509, 1.2411686 , -0.42793202],[ 0.33331299, 0.9025285 , -0.52360328, -0.32758048, -0.61839626, -0.59175986, 1.16830364, 0.67598459, 1.168464 , -1.57338336, 0.49627857, -1.57389963, -0.75686906, 0.19893459, -0.75557074, 0.70312091, 0.21153386, 0.69715637, -1.1882185 ],[ 0.6031696 , -0.42859027, -0.68883427, 3.05268496, -0.61839626, -0.59175986, 2.19659605, -1.46693591, 2.19675881, -2.74286476, -0.60815927, -2.7432675 , -0.07855114, -0.5677142 , -0.07880574, -1.30302599, 1.02426282, -1.30640087, 0.33235445],[ 0.67063375, -0.6546293 , -0.52360328, 3.05268496, -0.61839626, -0.59175986, -0.24008971, 0.62614923, -0.24004065, -1.03893233, 1.0986992 , -1.03793936, -0.27631146, 1.06780322, -0.27656174, -0.04918418, -0.60119509, -0.04588472, 1.09264093],[-0.74611345, -0.90578379, -0.52360328, -0.32758048, -0.61839626, -0.59175986, -0.93051461, 1.82219789, -0.93025113, 0.54272717, -0.85916786, 0.54209937, 0.15678365, 0.55670403, 0.15850147, 0.88224117, 0.61789834, 0.88291665, 1.8529274 ],[ 0.53570545, 1.50529926, -0.52360328, -0.32758048, -0.61839626, -0.59175986, 2.81173526, -1.66627735, 2.81135938, 2.30385178, -0.15634379, 2.3031117 , -0.79642112, 1.42557266, -0.79512194, -1.73291462, 1.83699177, -1.73099578, 1.8529274 ]])
y = np.array([0,0,0,0,0,1,1])
回答:
StratifiedKFold
已移至 model_selection
。所以你应该这样做:
from sklearn.model_selection import StratifiedKFolddef stratified_cv(X, y, clf_class, shuffle=True, n_folds=10, **kwargs): stratified_k_fold = StratifiedKFold(n_splits=n_folds, shuffle=shuffle) y_pred = y.copy() # ii -> 训练集索引 # jj -> 测试集索引 for ii, jj in stratified_k_fold.split(X,y): X_train, X_test = X[ii], X[jj] y_train = y[ii] clf = clf_class(**kwargs) clf.fit(X_train,y_train) y_pred[jj] = clf.predict(X_test) return y_pred