如果我使用GridSearchCV和一个管道(pipeline)得到了最优参数,有没有办法保存训练好的模型,以便将来我可以将整个管道应用到新数据上并为其生成预测?例如,我有以下管道和参数的网格搜索:
pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(SVC(probability=True))),])parameters = { 'vect__ngram_range': ((1, 1),(1, 2),(1,3)), # unigrams or bigrams 'clf__estimator__kernel': ('rbf','linear'), 'clf__estimator__C': tuple([10**i for i in range(-10,11)]),}grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,verbose=1)print("Performing grid search...")print("pipeline:", [name for name, _ in pipeline.steps])print("parameters:")pprint(parameters)t0 = time()#Conduct the grid searchgrid_search.fit(X,y)print("done in %0.3fs" % (time() - t0))print()print("Best score: %0.3f" % grid_search.best_score_)print("Best parameters set:")#Obtain the top performing parametersbest_parameters = grid_search.best_estimator_.get_params()#Print the resultsfor param_name in sorted(parameters.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name]))
现在我想将所有这些步骤保存到一个单一流程中,以便我可以将其应用到新的、未见过的数据集上,并且它会使用相同的参数、向量化器和转换器来转换、实现并报告结果?
回答:
你可以直接将GridSearchCV
对象进行pickle保存,然后在你想用它预测新数据时再解封(unpickle)它。
import pickle# Fit model and pickle fitted modelgrid_search.fit(X,y)with open('/model/path/model_pickle_file', "w") as fp: pickle.dump(grid_search, fp)# Load model from filewith open('/model/path/model_pickle_file', "r") as fp: grid_search_load = pickle.load(fp)# Predict new data with model loaded from disky_new = grid_search_load.best_estimator_.predict(X_new)