Scikit-learn的Iterative Imputer可以以轮循的方式填补缺失值。为了评估其与其他传统回归器的性能,可以构建一个简单的管道并从cross_val_score中获取评分指标。问题在于Iterative Imputer没有’predict’方法,错误信息如下:
AttributeError: 'IterativeImputer' object has no attribute 'predict'
请看一个尝试实现的简单示例:
# import librariesimport pandas as pdfrom sklearn.experimental import enable_iterative_imputerfrom sklearn.impute import IterativeImputerfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import cross_val_scorefrom sklearn.pipeline import Pipeline# define scaler, model and pipelinescaler = StandardScaler() # use any scalerimputer = IterativeImputer() # with any estimator, default = BayesianRidge()pipeline = Pipeline(steps=[('s', scaler), ('i', imputer)])train, test = df.values, df['A'].values scores = cross_val_score(pipeline, train, test, cv=10, scoring='r2')print(scores)
有什么可能的解决方案?如果需要自定义包装类,应该如何编写以包含’predict’方法?
回答:
cross_val_score
需要 pipeline
以 model
结束(该模型具有 predict
方法)
scaler = StandardScaler()imputer = IterativeImputer()model = BayesianRidge() # any modelpipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])
没有 model
的 cross_val_score
是没有意义的。
我还看到了另一个问题 – 关于你在 cross_val_score
中使用的 train
, test
值。
应该使用 X
, y
而不是 train
, test
,但这只是名称问题,所以不是那么重要,重要的是你分配给变量的值。
问题在于 X
不应该包含 y
,但你使用了 train = df.values
,所以你创建的 X
包含了 y
df_train = pd.DataFrame({ 'X': range(20), 'y': range(20), })X_train = df_train[ ['X'] ] # it needs inner `[]` to create DataFrame, not Seriesy_train = df_train[ 'y' ] # it has to be single column (Series)scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
(顺便说一下:你不需要使用 .values
)
对于更多列也是如此
df_train = pd.DataFrame({ 'A': range(20), 'B': range(20), 'y': range(20), })X_train = df_train[ ['A', 'B'] ]y_train = df_train[ 'y' ]
最小工作代码,但使用的是无用的假数据
# import librariesimport pandas as pdfrom sklearn.experimental import enable_iterative_imputerfrom sklearn.impute import IterativeImputerfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import cross_val_scorefrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import BayesianRidgedf_train = pd.DataFrame({ 'A': range(100), # fake data 'B': range(100), # fake data 'y': range(100), # fake data })df_test = pd.DataFrame({ 'A': range(20), # fake data 'B': range(20), # fake data 'y': range(20), # fake data })# define scaler, model and pipelinescaler = StandardScaler()imputer = IterativeImputer()model = BayesianRidge()pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])X_train = df_train[ ['A', 'B'] ] # it needs inner `[]` to create DataFrame, not Seriesy_train = df_train[ 'y' ] # it has to be single column (Series)scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')print(scores)X_test = df_test[['A', 'B']]y_test = df_test['y']scores = cross_val_score(pipeline, X_test, y_test, cv=10, scoring='r2')print(scores)