我有一个预处理脚本,用于从钻石数据集中获取数据并进行预处理。显然,我也需要它来预处理标签。
这是我的代码:
# Data Preprocessingimport pandas as pdfrom sklearn.compose import ColumnTransformerfrom sklearn.impute import SimpleImputerfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom icecream import icdef diamond_preprocess(data_dir): data = pd.read_csv(data_dir) cleaned_data = data.drop(['id', 'depth_percent'], axis=1) # Features I don't want x = cleaned_data.drop(['price'], axis=1) # Train data y = cleaned_data['price'] # Label data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=99) numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist() categorical_features = x_train.select_dtypes(include=['object']).columns.tolist() numerical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), # Fill in missing data with median ('scaler', StandardScaler()) # Scale data ]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), # Fill in missing data with 'missing' ('onehot', OneHotEncoder(handle_unknown='ignore')) # One hot encode categorical data ]) preprocessor_pipeline = ColumnTransformer( transformers=[ ('num', numerical_transformer, numerical_features), ('cat', categorical_transformer, categorical_features) ]) # Fit to the training data preprocessor_pipeline.fit(x_train) preprocessor_pipeline.fit(y_train) # Apply the pipeline to the training and test data x_train_pipe = preprocessor_pipeline.transform(x_train) x_test_pipe = preprocessor_pipeline.transform(x_test) y_train_pipe = preprocessor_pipeline.transform(y_train) y_test_pipe = preprocessor_pipeline.transform(y_test) x_train = pd.DataFrame(data=x_train_pipe) x_test = pd.DataFrame(data=x_test_pipe) y_train = pd.DataFrame(data=y_train_pipe) y_test = pd.DataFrame(data=y_test_pipe) return x_train, x_test, y_train, y_test
我对我的代码是否正确以及我对sklearn中管道和预处理的理解是否充分没有信心。显然,解释器也同意我的看法,因为我得到了这个错误:
File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\sklearn\compose\_column_transformer.py", line 470, in fit self.fit_transform(X, y=y) File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\sklearn\compose\_column_transformer.py", line 502, in fit_transform self._check_n_features(X, reset=True) File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\sklearn\base.py", line 352, in _check_n_features n_features = X.shape[1]IndexError: tuple index out of range
如何像处理训练数据一样正确地预处理我的标签?解释一下会更好!
回答:
如果你想分别应用转换,你可以为目标列创建一个额外的管道,参见下面的示例。
import pandas as pdimport numpy as npfrom sklearn.compose import ColumnTransformerfrom sklearn.impute import SimpleImputerfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler, OneHotEncoder# generate the datadata = pd.DataFrame({ 'y': [1, 2, np.nan, 4, 5], 'x1': [6, 7, 8, np.nan, np.nan], 'x2': [9, 10, 11, np.nan, np.nan], 'x3': ['a', 'b', 'c', np.nan, np.nan], 'x4': [np.nan, np.nan, 'd', 'e', 'f']})# extract the features and targetx = data.drop(labels=['y'], axis=1)y = data[['y']] # note that this is a data frame, not a series# split the datax_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=99)# map the features to the corresponding types (numerical or categorical)numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist()categorical_features = x_train.select_dtypes(include=['object']).columns.tolist()# define the features pipelinenumerical_features_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])categorical_features_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])features_pipeline = ColumnTransformer(transformers=[ ('num_features', numerical_features_transformer, numerical_features), ('cat_features', categorical_features_transformer, categorical_features)])# define the target pipelinetarget_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])# fit the pipelines to the training datafeatures_pipeline.fit(x_train)target_pipeline.fit(y_train)# apply the pipelines to the training and test datax_train_pipe = features_pipeline.transform(x_train)x_test_pipe = features_pipeline.transform(x_test)y_train_pipe = target_pipeline.transform(y_train)y_test_pipe = target_pipeline.transform(y_test)x_train = pd.DataFrame(data=x_train_pipe)x_test = pd.DataFrame(data=x_test_pipe)y_train = pd.DataFrame(data=y_train_pipe)y_test = pd.DataFrame(data=y_test_pipe)