使用多个自定义类与Pipeline sklearn（Python）

我在为学生们讲解Pipeline教程时遇到了阻碍。我不是专家，但我正在努力改进。所以感谢您的宽容。实际上，我试图在一个pipeline中执行几个步骤来准备一个数据框用于分类器：

步骤1：描述数据框
步骤2：填充NaN值
步骤3：将分类值转换为数字

这是我的代码：

class Descr_df(object):    def transform (self, X):        print ("Structure of the data: \n {}".format(X.head(5)))        print ("Features names: \n {}".format(X.columns))        print ("Target: \n {}".format(X.columns[0]))        print ("Shape of the data: \n {}".format(X.shape))    def fit(self, X, y=None):        return selfclass Fillna(object):    def transform(self, X):        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)        for column in X.columns:            if column in non_numerics_columns:                X[column] = X[column].fillna(df[column].value_counts().idxmax())            else:                 X[column] = X[column].fillna(X[column].mean())                    return X    def fit(self, X,y=None):        return selfclass Categorical_to_numerical(object):    def transform(self, X):        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)        le = LabelEncoder()        for column in non_numerics_columns:            X[column] = X[column].fillna(X[column].value_counts().idxmax())            le.fit(X[column])            X[column] = le.transform(X[column]).astype(int)        return X    def fit(self, X, y=None):        return self

如果我执行步骤1和2，或者步骤1和3，它们都能工作，但如果我同时执行步骤1、2和3，我会得到这个错误：

pipeline = Pipeline([('df_intropesction', Descr_df()), ('fillna',Fillna()), ('Categorical_to_numerical', Categorical_to_numerical())])pipeline.fit(X, y)AttributeError: 'NoneType' object has no attribute 'columns'

回答：

这个错误产生的原因是在Pipeline中，第一个估计器的输出会传递给第二个，然后第二个估计器的输出会传递给第三个，以此类推…

根据Pipeline的文档：

依次拟合所有转换器并转换数据，然后使用最终估计器拟合转换后的数据。

因此，对于您的pipeline，执行步骤如下：

Descr_df.fit(X) -> 不做任何事并返回self
newX = Descr_df.transform(X) -> 应该返回某个值以赋给newX，该值应传递给下一个估计器，但您的定义没有返回任何东西（只进行打印）。因此隐式返回了None
Fillna.fit(newX) -> 不做任何事并返回self
Fillna.transform(newX) -> 调用newX.columns。但newX=None来自步骤2。因此产生了错误。

解决方案：更改Descr_df的transform方法以按原样返回数据框：

def transform (self, X):    print ("Structure of the data: \n {}".format(X.head(5)))    print ("Features names: \n {}".format(X.columns))    print ("Target: \n {}".format(X.columns[0]))    print ("Shape of the data: \n {}".format(X.shape))    return X

建议：让您的类继承自scikit中的BaseEstimator和Transformer类，以符合最佳实践。

即，将class Descr_df(object)更改为class Descr_df(BaseEstimator, TransformerMixin)，将Fillna(object)更改为Fillna(BaseEstimator, TransformerMixin)，依此类推。

有关Pipeline中自定义类的更多详细信息，请参见此示例：

http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

学技术

使用多个自定义类与Pipeline sklearn（Python）

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复