在sklearn中如何在管道中预处理标签?

我有一个预处理脚本,用于从钻石数据集中获取数据并进行预处理。显然,我也需要它来预处理标签。

这是我的代码:

# Data Preprocessingimport pandas as pdfrom sklearn.compose import ColumnTransformerfrom sklearn.impute import SimpleImputerfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom icecream import icdef diamond_preprocess(data_dir):    data = pd.read_csv(data_dir)    cleaned_data = data.drop(['id', 'depth_percent'], axis=1)  # Features I don't want    x = cleaned_data.drop(['price'], axis=1)  # Train data    y = cleaned_data['price']  # Label data    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=99)    numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist()    categorical_features = x_train.select_dtypes(include=['object']).columns.tolist()    numerical_transformer = Pipeline(steps=[        ('imputer', SimpleImputer(strategy='median')),  # Fill in missing data with median        ('scaler', StandardScaler())  # Scale data    ])    categorical_transformer = Pipeline(steps=[        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Fill in missing data with 'missing'        ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One hot encode categorical data    ])    preprocessor_pipeline = ColumnTransformer(        transformers=[            ('num', numerical_transformer, numerical_features),            ('cat', categorical_transformer, categorical_features)        ])    # Fit to the training data    preprocessor_pipeline.fit(x_train)    preprocessor_pipeline.fit(y_train)    # Apply the pipeline to the training and test data    x_train_pipe = preprocessor_pipeline.transform(x_train)    x_test_pipe = preprocessor_pipeline.transform(x_test)    y_train_pipe = preprocessor_pipeline.transform(y_train)    y_test_pipe = preprocessor_pipeline.transform(y_test)    x_train = pd.DataFrame(data=x_train_pipe)    x_test = pd.DataFrame(data=x_test_pipe)    y_train = pd.DataFrame(data=y_train_pipe)    y_test = pd.DataFrame(data=y_test_pipe)    return x_train, x_test, y_train, y_test

我对我的代码是否正确以及我对sklearn中管道和预处理的理解是否充分没有信心。显然,解释器也同意我的看法,因为我得到了这个错误:

     File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\sklearn\compose\_column_transformer.py", line 470, in fit    self.fit_transform(X, y=y)  File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\sklearn\compose\_column_transformer.py", line 502, in fit_transform    self._check_n_features(X, reset=True)  File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\sklearn\base.py", line 352, in _check_n_features    n_features = X.shape[1]IndexError: tuple index out of range

如何像处理训练数据一样正确地预处理我的标签?解释一下会更好!


回答:

如果你想分别应用转换,你可以为目标列创建一个额外的管道,参见下面的示例。

import pandas as pdimport numpy as npfrom sklearn.compose import ColumnTransformerfrom sklearn.impute import SimpleImputerfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler, OneHotEncoder# generate the datadata = pd.DataFrame({    'y':  [1, 2, np.nan, 4, 5],    'x1': [6, 7, 8, np.nan, np.nan],    'x2': [9, 10, 11, np.nan, np.nan],    'x3': ['a', 'b', 'c', np.nan, np.nan],    'x4': [np.nan, np.nan, 'd', 'e', 'f']})# extract the features and targetx = data.drop(labels=['y'], axis=1)y = data[['y']]  # note that this is a data frame, not a series# split the datax_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=99)# map the features to the corresponding types (numerical or categorical)numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist()categorical_features = x_train.select_dtypes(include=['object']).columns.tolist()# define the features pipelinenumerical_features_transformer = Pipeline(steps=[    ('imputer', SimpleImputer(strategy='median')),    ('scaler', StandardScaler())])categorical_features_transformer = Pipeline(steps=[    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),    ('onehot', OneHotEncoder(handle_unknown='ignore'))])features_pipeline = ColumnTransformer(transformers=[    ('num_features', numerical_features_transformer, numerical_features),    ('cat_features', categorical_features_transformer, categorical_features)])# define the target pipelinetarget_pipeline = Pipeline(steps=[    ('imputer', SimpleImputer(strategy='mean')),    ('scaler', StandardScaler())])# fit the pipelines to the training datafeatures_pipeline.fit(x_train)target_pipeline.fit(y_train)# apply the pipelines to the training and test datax_train_pipe = features_pipeline.transform(x_train)x_test_pipe = features_pipeline.transform(x_test)y_train_pipe = target_pipeline.transform(y_train)y_test_pipe = target_pipeline.transform(y_test)x_train = pd.DataFrame(data=x_train_pipe)x_test = pd.DataFrame(data=x_test_pipe)y_train = pd.DataFrame(data=y_train_pipe)y_test = pd.DataFrame(data=y_test_pipe)

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注