我刚开始学习Catboost,并尝试使用StratifiedKFold结合CatboostRegressor,但遇到了错误:
这里是经过编辑的帖子,包含完整的代码块和错误信息以供澄清。此外,我也尝试了使用for i, (train_index, test_index) in enumerate(fold.split(X,y)):但同样没有成功。
from sklearn.model_selection import KFold,StratifiedKFoldfrom sklearn.metrics import mean_squared_log_errorfrom sklearn.preprocessing import LabelEncoderfrom catboost import Pool, CatBoostRegressorfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)err = []y_pred = []for train_index, test_index in fold.split(X,y):#for i, (train_index, test_index) in enumerate(fold.split(X,y)): X_train, X_val = X.iloc[train_index], X.iloc[test_index] y_train, y_val = y[train_index], y[test_index] _train = Pool(X_train, label = y_train) _valid = Pool(X_val, label = y_val) cb = CatBoostRegressor(n_estimators = 20000, reg_lambda = 1.0, eval_metric = 'RMSE', random_seed = 42, learning_rate = 0.01, od_type = "Iter", early_stopping_rounds = 2000, depth = 7, cat_features = cate, bagging_temperature = 1.0) cb.fit(_train,cat_features=cate,eval_set = _valid, early_stopping_rounds = 2000, use_best_model = True, verbose_eval = 100) p = cb.predict(X_val) print("err: ",rmsle(y_val,p)) err.append(rmsle(y_val,p)) pred = cb.predict(test_df) y_pred.append(pred)predictions = np.mean(y_pred,0)
ValueError Traceback (most recent call last)<ipython-input-21-3a0df0c7b8d6> in <module>() 7 err = [] 8 y_pred = []----> 9 for train_index, test_index in fold.split(X,y): 10 #for i, (train_index, test_index) in enumerate(fold.split(X,y)): 11 X_train, X_val = X.iloc[train_index], X.iloc[test_index]~/anaconda3/envs/tensorflow_p36/lib/python3.6/site- packages/sklearn/model_selection/_split.py in split(self, X, y, groups) 333 .format(self.n_splits, n_samples)) 334 --> 335 for train, test in super().split(X, y, groups): 336 yield train, test 337 ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site- packages/sklearn/model_selection/_split.py in split(self, X, y, groups) 87 X, y, groups = indexable(X, y, groups) 88 indices = np.arange(_num_samples(X))---> 89 for test_index in self._iter_test_masks(X, y, groups): 90 train_index = indices[np.logical_not(test_index)] 91 test_index = indices[test_index]~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups) 684 685 def _iter_test_masks(self, X, y=None, groups=None):--> 686 test_folds = self._make_test_folds(X, y) 687 for i in range(self.n_splits): 688 yield test_folds == i~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _make_test_folds(self, X, y) 639 raise ValueError( 640 'Supported target types are: {}. Got {!r instead.'.format(--> 641 allowed_target_types, type_of_target_y)) 642 643 y = column_or_1d(y)ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
回答:
您遇到这个错误的原因非常基础,源自机器学习的基本理论:分层(stratification)仅定义用于分类(classification),以确保在分割中所有类别的代表性相等;在回归(regression)中是毫无意义的。仔细阅读错误消息,您应该能说服自己,它的意思是'continuous'
目标(即回归)不被支持,仅支持'binary'
或'multiclass'
(即分类);这不是scikit-learn的特殊情况,而确实是一个基本问题。
相关提示也包含在文档中(强调已添加):
分层K折交叉验证器
提供用于将数据分割成训练/测试集的训练/测试索引。
这种交叉验证对象是KFold的一个变体,返回分层的折叠。折叠是通过保持每个类别的样本百分比来制作的。
这里是一个简短的演示,改编自文档中的示例,但将目标y
从离散(分类)改为连续(回归):
import numpy as npfrom sklearn.model_selection import StratifiedKFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])y = np.array([0.1, 0.5, -1.1, 1.2]) # 连续目标,即回归问题skf = StratifiedKFold(n_splits=2)for train_index, test_index in skf.split(X,y): print("something")[...]ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
所以,简单来说,您实际上不能在您的(回归)设置中使用StratifiedKFold
;将其改为简单的KFold
然后继续进行…