如何在Scikit-learn中使用固定验证集（而不是K折交叉验证）来训练决策树分类器/随机森林分类器？

我刚开始接触机器学习和数据科学。如果这个问题很愚蠢，请原谅我。

我看到内置函数中有交叉验证的功能，但没有固定验证集的功能。我有一个数据集，包含50,000个样本，标签是1990年到2010年的年份。我需要用1990年到2008年的样本训练不同的分类器，然后用2009年的样本进行验证，最后用2010年的样本进行测试。

编辑：在@Quan Tran的回答之后，我尝试了这个。这就是应该做的吗？

# Fit a decision treeestimator1 = DecisionTreeClassifier( max_depth = 9, max_leaf_nodes=9)estimator1.fit(X_train, y_train)print estimator1# validate using validation setacc = np.zeros((20,20))  # store accuracy for i in range(20):     for j in range(20):         estimator1 = DecisionTreeClassifier(max_depth = i+1, max_leaf_nodes=j+2)         estimator1.fit(X_valid, y_valid)         y_pred = estimator1.predict(X_valid)         acc[i,j] = accuracy_score(y_valid, y_pred)best_mod = np.where(acc == acc.max())print best_modprint acc[best_mod] # Predict target valuesestimator1 = DecisionTreeClassifier(max_depth = int(best_mod[0]) + 1, max_leaf_nodes= int(best_mod[1]) + 2)estimator1.fit(X_valid, y_valid)y_pred = estimator1.predict(X_test)confusion = metrics.confusion_matrix(y_test, y_pred)TP = confusion[1, 1]TN = confusion[0, 0]FP = confusion[0, 1]FN = confusion[1, 0]# Classification Accuracyprint "======= ACCURACY ========"print((TP + TN) / float(TP + TN + FP + FN))print accuracy_score(y_valid, y_pred)# store the predicted probabilities for class y_pred_prob = estimator1.predict_proba(X_test)[:, 1]# plot a ROC curve for y_test and y_pred_probfpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)plt.plot(fpr, tpr)plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.0])plt.title('ROC curve for DecisionTreeClassifier')plt.xlabel('False Positive Rate (1 - Specificity)')plt.ylabel('True Positive Rate (Sensitivity)')plt.grid(True)print("======= AUC ========")print(metrics.roc_auc_score(y_test, y_pred_prob))

我得到的答案是这样的，准确率不是最好的。

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,        max_features=None, max_leaf_nodes=9, min_samples_leaf=1,        min_samples_split=2, min_weight_fraction_leaf=0.0,        presort=False, random_state=None, splitter='best')(array([5]), array([19]))[ 0.8489011]======= ACCURACY ========0.5741758241760.538461538462======= AUC ========0.547632099893

回答：

在这种情况下，有三个独立的数据集：训练集、测试集和验证集。

训练集用于拟合分类器的参数。例如：

clf = DecisionTreeClassifier(max_depth=2)clf.fit(trainfeatures, labels)

验证集用于调整分类器的超参数或找到训练过程的截止点。例如，在决策树的情况下，max_depth 是一个超参数。你需要通过尝试不同的超参数值（调整）并在验证集上比较性能指标（准确率/精确度等）来找到一组好的超参数。

测试集用于估计在未见数据上的错误率。在测试集上获得性能指标后，模型不得再进行训练/调整。

学技术

如何在Scikit-learn中使用固定验证集（而不是K折交叉验证）来训练决策树分类器/随机森林分类器？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复