我已经下载了这个数据集,以下是我的代码:
from sklearn.metrics import classification_report, confusion_matrixfrom sklearn.utils.multiclass import unique_labelsimport plotly.figure_factory as ffimport pandas as pdfrom sklearn.preprocessing import LabelEncoder, StandardScalerfrom sklearn.impute import SimpleImputerimport numpy as npfrom sklearn.impute import KNNImputerfrom sklearn.model_selection import train_test_splitfrom sklearn.svm import SVCfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import OrdinalEncoder, OneHotEncoderfrom sklearn.compose import make_column_transformerrandom_state = 27912df_train = pd.read_csv("...")df_test = pd.read_csv("...")X_train, X_test, y_train, y_test = train_test_split(df_train.drop(["Survived", "Ticket", "Cabin", "Name", "PassengerId"], axis = 1), df_train["Survived"], test_size=0.2, random_state=42)numeric_col_names = ["Age", "SibSp", "Parch", "Fare"]ordinal_col_names = ["Pclass"]one_hot_col_names = ["Embarked", "Sex"]ct = make_column_transformer( (SimpleImputer(strategy="median"), numeric_col_names), (SimpleImputer(strategy="most_frequent"), ordinal_col_names + one_hot_col_names), (OrdinalEncoder(), ordinal_col_names), (OneHotEncoder(), one_hot_col_names), (StandardScaler(), ordinal_col_names + one_hot_col_names + numeric_col_names))preprocessing_pipeline = Pipeline([("transformers", ct)])preprocessing_pipeline.fit_transform(X_train)
我在尝试创建一个用于预处理步骤的column_transformer
,然而,OneHotEncoding
步骤出现了错误,ValueError: Input contains NaN
。我不知道为什么会这样,因为我之前已经填补了值。有什么线索可以解释为什么会发生这种情况吗?
尝试像这样的解决方案也没有帮助
preprocessing_pipeline = Pipeline([("transformers", ct_first)])ct_second = make_column_transformer((OneHotEncoder(), one_hot_col_names),(StandardScaler(), ordinal_col_names + one_hot_col_names + numeric_col_names))pipeline = Pipeline([("transformer1", preprocessing_pipeline), ("transformer2", ct_second)])pipeline.fit_transform(X_train)
我想知道为什么会发生这种情况,以及为什么上述代码的第一次和第二次尝试是不正确的。谢谢
回答:
您需要为每种类型的列创建一个管道,以确保不同的步骤是按顺序应用的(即确保在编码和缩放之前先填补缺失值),另请参阅scikit-learn文档中的这个示例。