在实现FAMD和SMOTENC的imblearn管道时遇到AttributeError

我在尝试实现一个包含FAMD、SMOTENC和其他预处理步骤的管道。然而每次都会报错。如果我从管道中移除FAMD,它就能正常工作。

我的代码如下:

#Seperate the dataset in two partsnum_df= X_train_new.select_dtypes(include=[np.number]).columnscat_df= X_train_new.select_dtypes(exclude=[np.number]).columns#Create a mask for categorical featurescategorical_feature_mask = X_train_new.dtypes == objectprint(categorical_feature_mask)from sklearn.pipeline import make_pipelinefrom sklearn.compose import make_column_transformerfrom sklearn.compose import make_column_selector as selector#Create a pipeline to automate the preprocessing steps and SMOTENC togethernum_pipe = make_pipeline(SimpleImputer(strategy='median'))cat_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'),                          OneHotEncoder(handle_unknown='ignore'))transformer= make_column_transformer((num_pipe, selector(dtype_include='number')),                                      (cat_pipe, selector(dtype_include='object')),n_jobs=2)#Undersampling with SMOTENCfrom imblearn.over_sampling import SMOTENCsmote= SMOTENC(categorical_features=categorical_feature_mask,random_state=99)!pip install princefrom prince import FAMDfamd=FAMD(n_components=4,random_state=99)from imblearn.pipeline import make_pipeline as imb_pipeline#Fit the random forest learnerrf=RandomForestClassifier(n_estimators=300random_state=99)pipe=imb_pipeline(transformer,smote,famd,rf)pipe.fit(X_train_new,y_train_new)print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))

错误信息如下:

AttributeError                            Traceback (most recent call last)<ipython-input-24-2b7ea084a318> in <module>()      3 rf=RandomForestClassifier(n_estimators=300,max_features=3,criterion='entropy',random_state=99)      4 pipe=imb_pipeline(transformer,smote,famd,rf)----> 5 pipe.fit(X_train_new,y_train_new)      6 print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))6 frames/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in fit(self, X, y, **fit_params)    235     236         """--> 237         Xt, yt, fit_params = self._fit(X, y, **fit_params)    238         if self._final_estimator is not None:    239             self._final_estimator.fit(Xt, yt, **fit_params)/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit(self, X, y, **fit_params)    195                     Xt, fitted_transformer = fit_transform_one_cached(    196                         cloned_transformer, None, Xt, yt,--> 197                         **fit_params_steps[name])    198                 elif hasattr(cloned_transformer, "fit_resample"):    199                     Xt, yt, fitted_transformer = fit_resample_one_cached(/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)    350     351     def __call__(self, *args, **kwargs):--> 352         return self.func(*args, **kwargs)    353     354     def call_and_shelve(self, *args, **kwargs):/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit_transform_one(transformer, weight, X, y, **fit_params)    564 def _fit_transform_one(transformer, weight, X, y, **fit_params):    565     if hasattr(transformer, 'fit_transform'):--> 566         res = transformer.fit_transform(X, y, **fit_params)    567     else:    568         res = transformer.fit(X, y, **fit_params).transform(X)/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)    572         else:    573             # fit method of arity 2 (supervised transformation)--> 574             return self.fit(X, y, **fit_params).transform(X)    575     576 /usr/local/lib/python3.7/dist-packages/prince/famd.py in fit(self, X, y)     27      28         # Separate numerical columns from categorical columns---> 29         num_cols = X.select_dtypes(np.number).columns.tolist()     30         cat_cols = list(set(X.columns) - set(num_cols))     31 /usr/local/lib/python3.7/dist-packages/scipy/sparse/base.py in __getattr__(self, attr)    689             return self.getnnz()    690         else:--> 691             raise AttributeError(attr + " not found")    692     693     def transpose(self, axes=None, copy=False):AttributeError: select_dtypes not found

回答:

简而言之:尝试在你的OneHotEncoder中添加sparse=False。考虑向prince提交一个问题报告,以处理稀疏输入。

从错误跟踪中可以看出,问题在于FAMD.fit尝试使用X.select_dtypes来分离分类和数值数据。select_dtypes是一个pandas函数,因此我通常会认为prince是为操作数据框而设计的,而不是sklearn内部使用的numpy数组(如果需要的话,会从数据框转换)。然而,查看源代码,在那行代码的几行上面,他们确实将numpy数组转换为了数据框。但是,最后的跟踪消息来自于scipy。这表明你的X可能实际上是一个稀疏数组。确实,OneHotEncoder(在你的管道中更早的部分)倾向于输出稀疏数组,而ColumnTransformer会根据其组成部分和参数sparse_threshold来决定是否转换为稀疏或密集格式。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注