Home IT技术如何在SelectFromModel()中决定特征选择的阈值？

如何在SelectFromModel()中决定特征选择的阈值？

IT技术 xiaolong · 2025年4月15日 · 0 Comment

我在使用随机森林分类器进行特征选择。我总共有70个特征，我想从中选择最重要的特征。下面的代码显示了分类器从最重要到最不重要的特征排序。

代码：

feat_labels = data.columns[1:]clf = RandomForestClassifier(n_estimators=100, random_state=0)# 训练分类器clf.fit(X_train, y_train)importances = clf.feature_importances_indices = np.argsort(importances)[::-1]for f in range(X_train.shape[1]):    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

现在我尝试使用sklearn.feature_selection中的SelectFromModel，但如何为我的数据集决定阈值呢？

# 创建一个选择器对象，使用随机森林分类器来识别# 重要性超过0.15的特征sfm = SelectFromModel(clf, threshold=0.15)# 训练选择器sfm.fit(X_train, y_train)

当我尝试使用threshold=0.15并训练我的模型时，我得到了一个错误，说数据噪音太大或选择过于严格。

但如果我使用threshold=0.015，我就能在选出的新特征上训练我的模型。那么，我该如何决定这个阈值呢？

回答：

我会尝试以下方法：

从一个较低的阈值开始，例如：1e-4
使用SelectFromModel进行拟合和转换来减少特征
计算选定特征的估算器（在你的例子中是RandomForestClassifier）的指标（准确率等）
增加阈值并重复从第1步开始的所有步骤

使用这种方法，你可以估计出对于你的特定数据和估算器来说最佳的threshold值

machine-learning numpy pandas python scikit-learn

发表回复取消回复