我正在尝试创建一个机器学习模型来预测谁会在泰坦尼克号上幸存。每当我尝试拟合我的模型时,我都会得到这个错误:
coordinates = np.where(mask.transpose())[::-1]AttributeError: 'bool' object has no attribute 'transpose'
我运行的代码如下:
from xgboost import XGBClassifierfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.impute import SimpleImputerfrom sklearn.feature_selection import SelectFromModelfrom itertools import combinationsimport pandas as pd import numpy as np#read in datatraining_data = pd.read_csv('train.csv')testing_data = pd.read_csv('test.csv')#seperate X and YX_train_full = training_data.copy()y = X_train_full.SurvivedX_train_full.drop(['Survived'], axis=1, inplace=True)y_test = testing_data#get all str columnscat_columns1 = [cname for cname in X_train_full.columns if X_train_full[cname].dtype == "object"]interactions = pd.DataFrame(index= X_train_full)#create new featuresfor combination in combinations(cat_columns1,2): imputer = SimpleImputer(strategy='constant') new_col_name = '_'.join(combination) col1 = X_train_full[combination[0]] col2 = X_train_full[combination[1]] col1 = np.array(col1).reshape(-1,1) col2 = np.array(col2).reshape(-1,1) col1 = imputer.fit_transform(col1) col2 = imputer.fit_transform(col2) new_vals = col1 + '_' + col2 OneHot = OneHotEncoder() interactions[new_col_name] = OneHot.fit_transform(new_vals) interactions = interactions.reset_index(drop = True)#create new dataframe with new features includednew_df = X_train_full.join(interactions) #do the same for the test fileinteractions2 = pd.DataFrame(index= y_test)for combination in combinations(cat_columns1,2): imputer = SimpleImputer(strategy='constant') new_col_name = '_'.join(combination) col1 = y_test[combination[0]] col2 = y_test[combination[1]] col1 = np.array(col1).reshape(-1,1) col2 = np.array(col2).reshape(-1,1) col1 = imputer.fit_transform(col1) col2 = imputer.fit_transform(col2) new_vals = col1 + '_' + col2 OneHot = OneHotEncoder() interactions2[new_col_name] = OneHot.fit_transform(new_vals) interactions2[new_col_name] = new_vals interactions2 = interactions2.reset_index(drop = True)y_test = y_test.join(interactions2)#get names of cat columns (with new features added)cat_columns = [cname for cname in new_df.columns if new_df[cname].dtype == "object"]# Select numerical columnsnum_columns = [cname for cname in new_df.columns if new_df[cname].dtype in ['int64', 'float64']]#set up pipelinenumerical_transformer = SimpleImputer(strategy = 'constant')categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, num_columns), ('cat', categorical_transformer, cat_columns) ])model = XGBClassifier()my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model) ])#fit modelmy_pipeline.fit(new_df,y)
我读取的csv文件可以在Kaggle上找到,链接如下:
https://www.kaggle.com/c/titanic/data
我无法找出导致这个问题的具体原因。任何帮助将不胜感激。
回答:
这可能是因为你的数据中包含pd.NA
值。pd.NA
是在pandas 1.0.0中引入的,但仍被标记为实验性功能。
SimpleImputer
最终会运行data == np.nan
,这通常会返回一个numpy数组。然而,当data
包含pd.NA
值时,它返回的是一个单一的布尔标量值。
一个例子:
import pandas as pdimport numpy as nptest_pd_na = pd.DataFrame({"A": [1, 2, 3, pd.NA]})test_np_nan = pd.DataFrame({"A": [1, 2, 3, np.nan]})test_np_nan.to_numpy() == np.nan:> array([[False], [False], [False], [False]])test_pd_na.to_numpy() == np.nan> False
解决方案是,在运行SimpleImputer
之前,将所有pd.NA
值转换为np.nan
。你可以对数据框使用.replace({pd.NA: np.nan})
来实现这一点。其缺点显然是你会失去pd.NA
带来的好处,例如可以包含缺失数据的整数列,而不是这些列被转换为浮点数列。