机器学习随机森林

我正在尝试使用scikit-learn Python库在不平衡数据集上拟合随机森林分类器。

我的目标是使召回率和精确率的值大致相同。为此，我使用了RandomForestClassifier函数的class_weight参数。

当我用class_weight = {0:1, 1:1}拟合随机森林时（换句话说，假设数据集不平衡），我得到的结果是：

准确率：0.79精确率：0.63召回率：0.32AUC：0.74

当我将class_weight改为{0:1, 1:10}时，我得到的结果是：

准确率：0.79精确率：0.65召回率：0.29AUC：0.74

因此，召回率和精确率的值几乎没有变化（即使我将权重从10增加到100，变化也很小）。

由于X_train和X_test的比例相同（数据集有超过100万行），当我使用class_weight = {0:1, 1:10}时，不应该得到非常不同的召回率和精确率值吗？

回答：

如果你想提高模型的召回率，有一个更快的方法可以做到这一点。

你可以使用sklearn计算精确率-召回率曲线。

这条曲线会为你的模型展示精确率和召回率之间的权衡。

这意味着，如果你想提高模型的召回率，你可以要求随机森林为每个类别提供概率，然后将类别1的概率增加0.1，并将类别0的概率减少0.1。这将有效地提高你的召回率。

如果你绘制精确率-召回率曲线，你将能够找到使精确率和召回率相等的最佳阈值。

这里有一个来自sklearn的示例

from sklearn import svm, datasetsfrom sklearn.model_selection import train_test_splitimport numpy as npiris = datasets.load_iris()X = iris.datay = iris.target# Add noisy featuresrandom_state = np.random.RandomState(0)n_samples, n_features = X.shapeX = np.c_[X, random_state.randn(n_samples, 200 * n_features)]# Limit to the two first classes, and split into training and testX_train, X_test, y_train, y_test = train_test_split(X[y < 2], y[y < 2],                                                    test_size=.5,                                                    random_state=random_state)# Create a simple classifierclassifier = svm.LinearSVC(random_state=random_state)classifier.fit(X_train, y_train)y_score = classifier.decision_function(X_test)from sklearn.metrics import precision_recall_curveimport matplotlib.pyplot as pltfrom sklearn.utils.fixes import signatureprecision, recall, _ = precision_recall_curve(y_test, y_score)# In matplotlib < 1.5, plt.fill_between does not have a 'step' argumentstep_kwargs = ({'step': 'post'}               if 'step' in signature(plt.fill_between).parameters               else {})plt.step(recall, precision, color='b', alpha=0.2,         where='post')plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)plt.xlabel('Recall')plt.ylabel('Precision')plt.ylim([0.0, 1.05])plt.xlim([0.0, 1.0])

这应该会给你类似于这个的结果

学技术

机器学习随机森林

发表回复取消回复

相关文章：

Related Posts

为什么我们在K-means聚类方法中使用kmeans.fit函数？

如何获取Keras中ImageDataGenerator的.flow_from_directory函数扫描的类名？

如何查看每个词的tf-idf得分

如何修复 ‘ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]’？

如何向神经网络输入两个不同大小的输入？

逻辑回归与机器学习有何关联

发表回复 取消回复

发表回复取消回复