在scikit-learn中将RandomizedSearchCV（或GridSearchCV）与LeaveOneGroupOut交叉验证结合使用

我喜欢使用scikit的LOGO（留一组出）作为交叉验证方法，并结合学习曲线。在我处理的大多数情况下，这种方法效果很好，但我只能（高效地）使用我认为在这些情况下最关键的两个参数（基于经验）：最大特征数和估计器数量。以下是我的代码示例：

    Fscorer = make_scorer(f1_score, average = 'micro')    gp = training_data["GP"].values    logo = LeaveOneGroupOut()    from sklearn.ensemble import RandomForestClassifier    RF_clf100 = RandomForestClassifier (n_estimators=100, n_jobs=-1, random_state = 49)    RF_clf200 = RandomForestClassifier (n_estimators=200, n_jobs=-1, random_state = 49)    RF_clf300 = RandomForestClassifier (n_estimators=300, n_jobs=-1, random_state = 49)    RF_clf400 = RandomForestClassifier (n_estimators=400, n_jobs=-1, random_state = 49)    RF_clf500 = RandomForestClassifier (n_estimators=500, n_jobs=-1, random_state = 49)    RF_clf600 = RandomForestClassifier (n_estimators=600, n_jobs=-1, random_state = 49)    param_name = "max_features"    param_range = param_range = [5, 10, 15, 20, 25, 30]    plt.figure()    plt.suptitle('n_estimators = 100', fontsize=14, fontweight='bold')    _, test_scores = validation_curve(RF_clf100, X, y, cv=logo.split(X, y, groups=gp),                                      param_name=param_name, param_range=param_range,                                      scoring=Fscorer, n_jobs=-1)    test_scores_mean = np.mean(test_scores, axis=1)    plt.plot(param_range, test_scores_mean)    plt.xlabel(param_name)    plt.xlim(min(param_range), max(param_range))    plt.ylabel("F1")    plt.ylim(0.47, 0.57)    plt.legend(loc="best")    plt.show()    plt.figure()    plt.suptitle('n_estimators = 200', fontsize=14, fontweight='bold')    _, test_scores = validation_curve(RF_clf200, X, y, cv=logo.split(X, y, groups=gp),                                      param_name=param_name, param_range=param_range,                                      scoring=Fscorer, n_jobs=-1)    test_scores_mean = np.mean(test_scores, axis=1)    plt.plot(param_range, test_scores_mean)    plt.xlabel(param_name)    plt.xlim(min(param_range), max(param_range))    plt.ylabel("F1")    plt.ylim(0.47, 0.57)    plt.legend(loc="best")    plt.show()    ...    ...

但我真正想要的是将LOGO与网格搜索或随机搜索结合起来，以便更彻底地搜索参数空间。

目前我的代码如下所示：

param_dist = {"n_estimators": [100, 200, 300, 400, 500, 600],              "max_features": sp_randint(5, 30),              "max_depth": sp_randint(2, 18),              "criterion": ['entropy', 'gini'],              "min_samples_leaf": sp_randint(2, 17)}clf = RandomForestClassifier(random_state = 49)n_iter_search = 45random_search = RandomizedSearchCV(clf, param_distributions=param_dist,                                   n_iter=n_iter_search,                                   scoring=Fscorer, cv=8,                                   n_jobs=-1)random_search.fit(X, y)

当我将cv = 8替换为cv=logo.split(X, y, groups=gp)时，我得到了以下错误消息：

---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-10-0092e11ffbf4> in <module>()---> 35 random_search.fit(X, y)/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in fit(self, X, y, groups)   1183                                           self.n_iter,   1184                                           random_state=self.random_state)-> 1185         return self._fit(X, y, groups, sampled_params)/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in _fit(self, X, y, groups, parameter_iterable)    540     541         X, y, groups = indexable(X, y, groups)--> 542         n_splits = cv.get_n_splits(X, y, groups)    543         if self.verbose > 0 and isinstance(parameter_iterable, Sized):    544             n_candidates = len(parameter_iterable)/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in get_n_splits(self, X, y, groups)   1489             Returns the number of splitting iterations in the cross-validator.   1490         """-> 1491         return len(self.cv)  # Both iterables and old-cv objects support len   1492    1493     def split(self, X=None, y=None, groups=None):TypeError: object of type 'generator' has no len()

关于（1）发生了什么，以及更重要的是，（2）如何使其工作（将RandomizedSearchCV与LeaveOneGroupOut结合使用），有什么建议吗？

* 更新 2017年2月8日*

使用cv=logo和@Vivek Kumar的建议random_search.fit(X, y, wells)成功了

回答：

您不应该将logo.split()传递给RandomizedSearchCV，只需将cv对象如logo传递给它即可。RandomizedSearchCV会在内部调用split()来生成训练测试索引。您可以将您的gp组传递给RandomizedSearchCV或GridSearchCV对象的fit()调用中。

不要这样做：

random_search.fit(X, y)

而是这样做：

random_search.fit(X, y, gp)

编辑：您还可以在GridSearchCV或RandomizedSearchCV的构造函数中将gp作为字典传递给fit_params参数。

学技术

在scikit-learn中将RandomizedSearchCV（或GridSearchCV）与LeaveOneGroupOut交叉验证结合使用

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复