我在尝试实现一个包含FAMD、SMOTENC和其他预处理步骤的管道。然而每次都会报错。如果我从管道中移除FAMD,它就能正常工作。
我的代码如下:
#Seperate the dataset in two partsnum_df= X_train_new.select_dtypes(include=[np.number]).columnscat_df= X_train_new.select_dtypes(exclude=[np.number]).columns#Create a mask for categorical featurescategorical_feature_mask = X_train_new.dtypes == objectprint(categorical_feature_mask)from sklearn.pipeline import make_pipelinefrom sklearn.compose import make_column_transformerfrom sklearn.compose import make_column_selector as selector#Create a pipeline to automate the preprocessing steps and SMOTENC togethernum_pipe = make_pipeline(SimpleImputer(strategy='median'))cat_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder(handle_unknown='ignore'))transformer= make_column_transformer((num_pipe, selector(dtype_include='number')), (cat_pipe, selector(dtype_include='object')),n_jobs=2)#Undersampling with SMOTENCfrom imblearn.over_sampling import SMOTENCsmote= SMOTENC(categorical_features=categorical_feature_mask,random_state=99)!pip install princefrom prince import FAMDfamd=FAMD(n_components=4,random_state=99)from imblearn.pipeline import make_pipeline as imb_pipeline#Fit the random forest learnerrf=RandomForestClassifier(n_estimators=300random_state=99)pipe=imb_pipeline(transformer,smote,famd,rf)pipe.fit(X_train_new,y_train_new)print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))
错误信息如下:
AttributeError Traceback (most recent call last)<ipython-input-24-2b7ea084a318> in <module>() 3 rf=RandomForestClassifier(n_estimators=300,max_features=3,criterion='entropy',random_state=99) 4 pipe=imb_pipeline(transformer,smote,famd,rf)----> 5 pipe.fit(X_train_new,y_train_new) 6 print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))6 frames/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in fit(self, X, y, **fit_params) 235 236 """--> 237 Xt, yt, fit_params = self._fit(X, y, **fit_params) 238 if self._final_estimator is not None: 239 self._final_estimator.fit(Xt, yt, **fit_params)/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit(self, X, y, **fit_params) 195 Xt, fitted_transformer = fit_transform_one_cached( 196 cloned_transformer, None, Xt, yt,--> 197 **fit_params_steps[name]) 198 elif hasattr(cloned_transformer, "fit_resample"): 199 Xt, yt, fitted_transformer = fit_resample_one_cached(/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs) 350 351 def __call__(self, *args, **kwargs):--> 352 return self.func(*args, **kwargs) 353 354 def call_and_shelve(self, *args, **kwargs):/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit_transform_one(transformer, weight, X, y, **fit_params) 564 def _fit_transform_one(transformer, weight, X, y, **fit_params): 565 if hasattr(transformer, 'fit_transform'):--> 566 res = transformer.fit_transform(X, y, **fit_params) 567 else: 568 res = transformer.fit(X, y, **fit_params).transform(X)/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params) 572 else: 573 # fit method of arity 2 (supervised transformation)--> 574 return self.fit(X, y, **fit_params).transform(X) 575 576 /usr/local/lib/python3.7/dist-packages/prince/famd.py in fit(self, X, y) 27 28 # Separate numerical columns from categorical columns---> 29 num_cols = X.select_dtypes(np.number).columns.tolist() 30 cat_cols = list(set(X.columns) - set(num_cols)) 31 /usr/local/lib/python3.7/dist-packages/scipy/sparse/base.py in __getattr__(self, attr) 689 return self.getnnz() 690 else:--> 691 raise AttributeError(attr + " not found") 692 693 def transpose(self, axes=None, copy=False):AttributeError: select_dtypes not found
回答:
简而言之:尝试在你的OneHotEncoder
中添加sparse=False
。考虑向prince
提交一个问题报告,以处理稀疏输入。
从错误跟踪中可以看出,问题在于FAMD.fit
尝试使用X.select_dtypes
来分离分类和数值数据。select_dtypes
是一个pandas函数,因此我通常会认为prince
是为操作数据框而设计的,而不是sklearn内部使用的numpy数组(如果需要的话,会从数据框转换)。然而,查看源代码,在那行代码的几行上面,他们确实将numpy数组转换为了数据框。但是,最后的跟踪消息来自于scipy。这表明你的X
可能实际上是一个稀疏数组。确实,OneHotEncoder
(在你的管道中更早的部分)倾向于输出稀疏数组,而ColumnTransformer
会根据其组成部分和参数sparse_threshold
来决定是否转换为稀疏或密集格式。