我有一个包含100列连续特征和一个连续标签的数据集,我希望运行SVR;提取相关特征,调整超参数,然后对适合我数据的模型进行交叉验证。
我编写了以下代码:
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2) cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)# 定义要评估的管道model = SVR()fs = SelectKBest(score_func=mutual_info_regression)pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])# 定义网格grid = dict()# 尝试的特征数量grid['estimator__sel__k'] = [i for i in range(1, X_train.shape[1]+1)]# 定义网格搜索#search = GridSearchCV(pipeline, grid, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv)search = GridSearchCV( pipeline,# estimator=SVR(kernel='rbf'), param_grid={ 'estimator__svr__C': [0.1, 1, 10, 100, 1000], 'estimator__svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10], 'estimator__svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10] }, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)for param in search.get_params().keys(): print(param)# 执行搜索results = search.fit(X_train, y_train)# 总结最佳结果print('Best MAE: %.3f' % results.best_score_)print('Best Config: %s' % results.best_params_)# 总结所有结果means = results.cv_results_['mean_test_score']params = results.cv_results_['params']for mean, param in zip(means, params): print(">%.3f with: %r" % (mean, param))
我得到了以下错误:
ValueError: Invalid parameter estimator for estimator Pipeline(memory=None, steps=[('sel', SelectKBest(k=10, score_func=<function mutual_info_regression at 0x7fd2ff649cb0>)), ('svr', SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False))], verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
当我按照错误信息的建议打印estimator.get_params().keys()
时,得到的是:
cverror_scoreestimator__memoryestimator__stepsestimator__verboseestimator__selestimator__svrestimator__sel__kestimator__sel__score_funcestimator__svr__Cestimator__svr__cache_sizeestimator__svr__coef0estimator__svr__degreeestimator__svr__epsilonestimator__svr__gammaestimator__svr__kernelestimator__svr__max_iterestimator__svr__shrinkingestimator__svr__tolestimator__svr__verboseestimatoriidn_jobsparam_gridpre_dispatchrefitreturn_train_scorescoringverboseFitting 5 folds for each of 405 candidates, totalling 2025 fits
但是当我将这一行:
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
改为:
pipeline = Pipeline(steps=[('estimator__sel',fs), ('estimator__svr', model)])
我得到了以下错误:
ValueError: Estimator names must not contain __: got ['estimator__sel', 'estimator__svr']
能有人解释一下我哪里做错了,即如何将管道/特征选择步骤与GridSearchCV结合起来?
作为旁注,如果我在GridSearchCV中注释掉pipeline
,并取消注释estimator=SVR(kernal='rbf')
,这个单元格可以无问题地运行,但在这种情况下,我认为我没有包含特征选择,因为它在任何地方都没有被调用。我之前看到了一些类似的Stack Overflow问题,例如这里,但它们似乎没有回答这个具体问题。
有没有更简洁的方式来编写这个?
回答:
第一个错误消息是关于pipeline
参数的,而不是search
参数的,并且表明你的param_grid
有问题,而不是管道步骤名称。运行pipeline.get_params().keys()
应该会显示正确的参数名称。你的网格应该是:
param_grid={ 'svr__C': [0.1, 1, 10, 100, 1000], 'svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10], 'svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10] },
我不知道用普通的SVR替换管道是如何运行的;你的参数网格在那里也没有指定正确的东西…