在Pipeline中在分类器之后使用度量

我继续研究pipeline。我的目标是仅通过pipeline来执行机器学习的每个步骤。这样会更加灵活,并且更容易将我的pipeline适应其他用例。因此,我所做的步骤如下:

  • 步骤1:填充NaN值
  • 步骤2:将分类值转换为数字
  • 步骤3:分类器
  • 步骤4:网格搜索
  • 步骤5:添加度量(失败)

这是我的代码:

import pandas as pdfrom sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.feature_selection import SelectKBestfrom sklearn.preprocessing import LabelEncoderfrom sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import roc_curve, aucimport matplotlib.pyplot as pltfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import f1_scoreclass FillNa(BaseEstimator, TransformerMixin):    def transform(self, x, y=None):            non_numerics_columns = x.columns.difference(                x._get_numeric_data().columns)            for column in x.columns:                if column in non_numerics_columns:                    x.loc[:, column] = x.loc[:, column].fillna(                        df[column].value_counts().idxmax())                else:                    x.loc[:, column] = x.loc[:, column].fillna(                        x.loc[:, column].mean())            return x    def fit(self, x, y=None):        return selfclass CategoricalToNumerical(BaseEstimator, TransformerMixin):    def transform(self, x, y=None):        non_numerics_columns = x.columns.difference(            x._get_numeric_data().columns)        le = LabelEncoder()        for column in non_numerics_columns:            x.loc[:, column] = x.loc[:, column].fillna(                x.loc[:, column].value_counts().idxmax())            le.fit(x.loc[:, column])            x.loc[:, column] = le.transform(x.loc[:, column]).astype(int)        return x    def fit(self, x, y=None):        return selfclass Perf(BaseEstimator, TransformerMixin):    def fit(self, clf, x, y, perf="all"):        """Only for classifier model.        Return AUC, ROC, Confusion Matrix and F1 score from a classifier and df        You can put a list of eval instead a string for eval paramater.        Example: eval=['all', 'auc', 'roc', 'cm', 'f1'] will return these 4        evals.        """        evals = {}        y_pred_proba = clf.predict_proba(x)[:, 1]        y_pred = clf.predict(x)        perf_list = perf.split(',')        if ("all" or "roc") in perf.split(','):            fpr, tpr, _ = roc_curve(y, y_pred_proba)            roc_auc = round(auc(fpr, tpr), 3)            plt.style.use('bmh')            plt.figure(figsize=(12, 9))            plt.title('ROC Curve')            plt.plot(fpr, tpr, 'b',                     label='AUC = {}'.format(roc_auc))            plt.legend(loc='lower right', borderpad=1, labelspacing=1,                       prop={"size": 12}, facecolor='white')            plt.plot([0, 1], [0, 1], 'r--')            plt.xlim([-0.1, 1.])            plt.ylim([-0.1, 1.])            plt.ylabel('True Positive Rate')            plt.xlabel('False Positive Rate')            plt.show()        if "all" in perf_list or "auc" in perf_list:            fpr, tpr, _ = roc_curve(y, y_pred_proba)            evals['auc'] = auc(fpr, tpr)        if "all" in perf_list or "cm" in perf_list:            evals['cm'] = confusion_matrix(y, y_pred)        if "all" in perf_list or "f1" in perf_list:            evals['f1'] = f1_score(y, y_pred)        return evalspath = '~/proj/akd-doc/notebooks/data/'df = pd.read_csv(path + 'titanic_tuto.csv', sep=';')y = df.pop('Survival-Status').replace(to_replace=['dead', 'alive'],                                      value=[0., 1.])X = df.copy()X_train, X_test, y_train, y_test = train_test_split(    X.copy(), y.copy(), test_size=0.2, random_state=42)percent = 0.50nb_features = round(percent * df.shape[1]) + 1clf = RandomForestClassifier()pipeline = Pipeline([('fillna', FillNa()),                     ('categorical_to_numerical', CategoricalToNumerical()),                     ('features_selection', SelectKBest(k=nb_features)),                     ('random_forest', clf),                     ('perf', Perf())])params = dict(random_forest__max_depth=list(range(8, 12)),              random_forest__n_estimators=list(range(30, 110, 10)))cv = GridSearchCV(pipeline, param_grid=params)cv.fit(X_train, y_train)

我知道打印ROC曲线不是最理想的做法,但这目前不是问题所在。

所以,当我执行这段代码时,我得到的是:

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('fillna', FillNa()), ('categorical_to_numerical', CategoricalToNumerical()), ('features_selection', SelectKBest(k=10, score_func=<function f_classif at 0x7f4ed4c3eae8>)), ('random_forest', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',            max_depth=None,...=1, oob_score=False, random_state=None,            verbose=0, warm_start=False)), ('perf', Perf())]) does not.

我对所有想法都感兴趣…


回答:

正如错误信息所述,您需要在GridSearchCV中指定scoring参数。

使用

GridSearchCV(pipeline, param_grid=params, scoring = 'accuracy')

编辑(基于评论中的问题):

如果您需要整个X_train和y_train的roc、auc曲线和f1(而不是GridSearchCV的所有拆分),最好将Perf类保留在pipeline之外。

pipeline = Pipeline([('fillna', FillNa()),                     ('categorical_to_numerical', CategoricalToNumerical()),                     ('features_selection', SelectKBest(k=nb_features)),                     ('random_forest', clf)])#Fit the data in the pipelinepipeline.fit(X_train, y_train)performance_meas = Perf()performance_meas.fit(pipeline, X_train, y_train)

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注