使用不平衡数据集进行特征选择时遇到的问题

我正在使用不平衡数据集(54:38:7%)进行特征选择,使用RFECV的方法如下:

# 制作多类别logloss指标
from sklearn.metrics import log_loss, make_scorer
log_loss_rfe = make_scorer(score_func=log_loss, greater_is_better=False)
# 初始化Light GBM分类器
lgb_rfe = LGBMClassifier(objective='multiclass', learning_rate=0.01, verbose=0, force_col_wise=True,
                         random_state=100, n_estimators=5_000, n_jobs=7)
# 初始化RFECV
rfe = RFECV(estimator=lgb_rfe, min_features_to_select=2, verbose=3, n_jobs=2, cv=3, scoring=log_loss_rfe)
# 拟合
rfe.fit(X=X_train, y=y_train)

我遇到了一个错误,大概是因为sklearn的RFECV所做的子样本中没有包含我数据中的所有类别。在RFECV之外使用相同的数据拟合时没有问题。

以下是完整的错误信息:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: """Traceback (most recent call last):
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
    r = call_item()
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/feature_selection/_rfe.py", line 37, in _rfe_single_fit
    return rfe._fit(
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/feature_selection/_rfe.py", line 259, in _fit
    self.scores_.append(step_score(estimator, features))
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/feature_selection/_rfe.py", line 39, in <lambda>
    lambda estimator, features: _score(
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 199, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true,
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 242, in _score
    return self._sign * self._score_func(y_true, y_pred,
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py", line 2265, in log_loss
    raise ValueError("y_true and y_pred contain different number of "
ValueError: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2]"""The above exception was the direct cause of the following exception:
ValueError                                Traceback (most recent call last)
<ipython-input-9-5feb62a6f457> in <module>
      1 rfe = RFECV(estimator=lgb_rfe, min_features_to_select=2, verbose=3, n_jobs=2, cv=3, scoring=log_loss_rfe)
----> 2 rfe.fit(X=X_train, y=y_train)
~/ds_jup_venv/lib/python3.8/site-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups)
    603             func = delayed(_rfe_single_fit)
    604 --> 605         scores = parallel(
    606             func(rfe, self.estimator, X, y, train, test, scorer)
    607             for train, test in cv.split(X, y, groups))
~/ds_jup_venv/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1052
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time
~/ds_jup_venv/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())
~/ds_jup_venv/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e
1 frames
/usr/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result
ValueError: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2]

如何解决这个问题以便能够递归地选择特征?


回答:

Log-loss需要概率预测,而不是类别预测,因此你应该添加

log_loss_rfe = make_scorer(score_func=log_loss, needs_proba=True, greater_is_better=False)

错误的原因是如果没有这样做,传递的y_pred是一维的(类别0,1,2),而sklearn假设这是一个二分类问题,这些预测是正类别的概率。为了处理这个问题,它会添加负类别的概率,但这样一来就只有两列,而你的类别有三个。

Related Posts

如何对SVC进行超参数调优?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

如何在初始训练后向模型添加训练数据?

我想在我的scikit-learn模型已经训练完成后再…

使用Google Cloud Function并行运行带有不同用户参数的相同训练作业

我正在寻找一种方法来并行运行带有不同用户参数的相同训练…

加载Keras模型,TypeError: ‘module’ object is not callable

我已经在StackOverflow上搜索并阅读了文档,…

在计算KNN填补方法中特定列中NaN值的”距离平均值”时

当我从头开始实现KNN填补方法来处理缺失数据时,我遇到…

使用巨大的S3 CSV文件或直接从预处理的关系型或NoSQL数据库获取数据的机器学习训练/测试工作

已关闭。此问题需要更多细节或更清晰的说明。目前不接受回…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注