我在使用Pipeline和GridSearchCV查找平均绝对误差(MAE)时遇到了挑战
背景:
我参与了一个数据科学项目(最小工作示例如下),其中分类器的性能指标会返回一个MAE值。
#Libraryimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_absolute_error#Data import and preparationdata = pd.read_csv("data.csv")data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']X = data[data_features]y = data.fault_severity#Train Validation Split for Cross ValidationX_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)#RandomForest ModelingRF_model = RandomForestClassifier(n_estimators=100, random_state=0)RF_model.fit(X_train, y_train)#RandomForest Predictiony_predict = RF_model.predict(X_valid)#MAE print(mean_absolute_error(y_valid, y_predict))#Output:# 0.38727149627623564
挑战:
现在我尝试使用Pipeline和GridSearchCV实现相同的功能(最小工作示例如下)。期望得到与上面相同的MAE值。不幸的是,我在以下三种方法中都没有成功。
#Libraryimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import GridSearchCV#Data import and preparationdata = pd.read_csv("data.csv")data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']X = data[data_features]y = data.fault_severity#Train Validation Split for Cross ValidationX_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)#RandomForest Modeling via Pipeline and Hyper-parameter tuningsteps = [('rf', RandomForestClassifier(random_state=0))]pipeline = Pipeline(steps) # define the pipeline object.parameters = {'rf__n_estimators':[100]}grid = GridSearchCV(pipeline, param_grid=parameters, scoring='neg_mean_squared_error', cv=None, refit=True)grid.fit(X_train, y_train)#Approach 1:print(grid.best_score_)# Output:# -0.508130081300813#Approach 2:y_predict=grid.predict(X_valid)print("score = %3.2f"%(grid.score(y_predict, y_valid)))# Output:# ValueError: Expected 2D array, got 1D array instead:# array=[0. 0. 0. ... 0. 1. 0.].# Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.#Approach 3:y_predict_df = pd.DataFrame(y_predict.reshape(len(y_predict), -1),columns=['fault_severity'])print("score = %3.2f"%(grid.score(y_predict_df, y_valid)))# Output: # ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 1
讨论:
方法1:在GridSearchCV()
中,scoring
变量设置为neg_mean_squared_error
,尝试读取grid.best_score_
。但没有得到相同的MAE结果。
方法2:尝试使用grid.predict(X_valid)
获取y_predict
值。然后尝试使用grid.score(y_predict, y_valid)
获取MAE,因为GridSearchCV()
中的scoring
变量设置为neg_mean_squared_error
。它返回了一个ValueError
,抱怨“期望2D数组,但得到的是1D数组”。
方法3:尝试重塑y_predict
,但这也不起作用。这次它返回了“ValueError: 模型的特征数量必须与输入匹配。”
如果您能帮助指出我可能犯的错误,将会非常有帮助。
如果需要,data.csv可在https://www.dropbox.com/s/t1h53jg1hy4x33b/data.csv获取
非常感谢
回答:
您试图将mean_absolute_error
与neg_mean_squared_error
进行比较,这两者是非常不同的,请参考这里了解更多详情。您应该在创建GridSearchCV
对象时使用neg_mean_absolute_error
,如下所示:
grid = GridSearchCV(pipeline, param_grid=parameters,scoring='neg_mean_absolute_error', cv=None, refit=True)
此外,sklearn中的score方法需要(X,y)
作为输入,其中x
是形状为(n_samples, n_features)
的输入特征,y
是目标标签,您需要将grid.score(y_predict, y_valid)
更改为grid.score(X_valid, y_valid)
。