我将Scikit-Learn的随机森林模型封装在一个函数中,如下所示:
from sklearn.base import BaseEstimator, RegressorMixinclass Model(BaseEstimator, RegressorMixin): def __init__(self, model): self.model = model def fit(self, X, y): self.model.fit(X, y) return self def score(self, X, y): from sklearn.metrics import mean_squared_error return mean_squared_error(y_true=y, y_pred=self.model.predict(X), squared=False) def predict(self, X): return self.model.predict(X)
class RandomForest(Model): def __init__(self, n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features=None): self.n_estimators=n_estimators self.max_depth=max_depth self.min_samples_split=min_samples_split self.min_samples_leaf=min_samples_leaf self.max_features=max_features from sklearn.ensemble import RandomForestRegressor self.model = RandomForestRegressor(n_estimators=self.n_estimators, max_depth=self.max_depth, min_samples_split=self.min_samples_split, min_samples_leaf=self.min_samples_leaf, max_features=self.max_features, random_state = 777) def get_params(self, deep=True): return {"n_estimators": self.n_estimators, "max_depth": self.max_depth, "min_samples_split": self.min_samples_split, "min_samples_leaf": self.min_samples_leaf, "max_features": self.max_features} def set_params(self, **parameters): for parameter, value in parameters.items(): setattr(self, parameter, value) return self
我主要遵循Scikit-Learn的官方指南,可以在https://scikit-learn.org/stable/developers/develop.html找到
我的网格搜索如下所示:
grid_search = GridSearchCV(estimator=RandomForest(), param_grid={'max_depth':[1, 3, 6], 'n_estimators':[10, 100, 300]}, n_jobs=-1, scoring='neg_root_mean_squared_error', cv=5, verbose=True).fit(X, y) print(pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score'))
网格搜索的输出结果和grid_search.cv_results_如下所示
Fitting 5 folds for each of 9 candidates, totalling 45 fits[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. mean_fit_time std_fit_time mean_score_time std_score_time \0 0.210918 0.002450 0.016754 0.000223 1 0.207049 0.001675 0.016579 0.000147 2 0.206495 0.002001 0.016598 0.000158 3 0.206799 0.002417 0.016740 0.000144 4 0.207534 0.001603 0.016668 0.000269 5 0.206384 0.001396 0.016605 0.000136 6 0.220052 0.024280 0.017247 0.001137 7 0.226838 0.027507 0.017351 0.000979 8 0.205738 0.003420 0.016246 0.000626 param_max_depth param_n_estimators params \0 1 10 {'max_depth': 1, 'n_estimators': 10} 1 1 100 {'max_depth': 1, 'n_estimators': 100} 2 1 300 {'max_depth': 1, 'n_estimators': 300} 3 3 10 {'max_depth': 3, 'n_estimators': 10} 4 3 100 {'max_depth': 3, 'n_estimators': 100} 5 3 300 {'max_depth': 3, 'n_estimators': 300} 6 6 10 {'max_depth': 6, 'n_estimators': 10} 7 6 100 {'max_depth': 6, 'n_estimators': 100} 8 6 300 {'max_depth': 6, 'n_estimators': 300} split0_test_score split1_test_score split2_test_score split3_test_score \0 -5.246725 -3.200585 -3.326962 -3.209387 1 -5.246725 -3.200585 -3.326962 -3.209387 2 -5.246725 -3.200585 -3.326962 -3.209387 3 -5.246725 -3.200585 -3.326962 -3.209387 4 -5.246725 -3.200585 -3.326962 -3.209387 5 -5.246725 -3.200585 -3.326962 -3.209387 6 -5.246725 -3.200585 -3.326962 -3.209387 7 -5.246725 -3.200585 -3.326962 -3.209387 8 -5.246725 -3.200585 -3.326962 -3.209387 split4_test_score mean_test_score std_test_score rank_test_score 0 -2.911422 -3.579016 0.845021 1 1 -2.911422 -3.579016 0.845021 1 2 -2.911422 -3.579016 0.845021 1 3 -2.911422 -3.579016 0.845021 1 4 -2.911422 -3.579016 0.845021 1 5 -2.911422 -3.579016 0.845021 1 6 -2.911422 -3.579016 0.845021 1 7 -2.911422 -3.579016 0.845021 1 8 -2.911422 -3.579016 0.845021 1 [Parallel(n_jobs=-1)]: Done 45 out of 45 | elapsed: 3.2s finished
我的问题是,为什么网格搜索在所有数据分割上返回完全相同的结果?
我的假设是,网格搜索似乎只对所有数据分割执行了一个参数网格(例如{‘max_depth’: 1, ‘n_estimators’: 10})。如果是这样的话,为什么会发生这种情况呢?
最后,如何使网格搜索能够为所有数据分割返回正确的结果?
回答:
您的set_params
方法实际上并没有更改RandomForestRegressor
实例在self.model
属性中的超参数。相反,它直接将属性设置到您的RandomForest
实例中(之前不存在的,并且不会影响实际的模型!)。因此,网格搜索反复设置这些无关紧要的新参数,而每次拟合的实际模型都是相同的。(同样,get_params
方法获取RandomForest
的属性,这些属性与RandomForestRegressor
的属性不同。)
您可以通过让set_params
只调用self.model.set_params
来解决大部分问题(并且让get_params
使用self.model.<parameter_name>
而不是仅仅self.<parameter_name>
)。
我认为还有另一个问题,但我不知道您的示例是如何运行的,因为它:您使用self.<parameter_name>
实例化model
属性,但在__init__
中从未定义过它。