我试图使用sklearn_pandas
中的CategoricalImputer
来填充NaN分类值。
from sklearn_pandas import CategoricalImputerimputer = CategoricalImputer()nan_columns = train_df.loc[:, train_df.isnull().any()]for column in nan_columns: imputer.fit_transform(column)
但是imputer.fit_transform(column)
返回了以下错误:
AttributeError: 'str' object has no attribute 'copy'
我按照文档操作的。我哪里做错了?
编辑:
我添加了这个单元格:
from sklearn.impute import SimpleImputernan_columns = train_df.loc[:, train_df.isnull().any()]imputer = SimpleImputer(strategy="most_frequent")imputer.fit_transform(train_df)msno.bar(train_df.sample(1000), labels=True, fontsize=8)
然而,这并没有奏效。以下是柱状图,显示列中仍有缺失值:
回答:
您可以使用scikit-learn的SimpleImputer
处理分类值,只需使用strategy="most_frequent"
参数。
imp = SimpleImputer(strategy="most_frequent")df = pd.DataFrame({"x": ["a", "a", np.nan], "y": ["c", np.nan, "c"], "z": ["a", np.nan, np.nan]})print(df)df[:] = imp.fit_transform(df)print(df)
结果如下
x y z0 a c a1 a NaN NaN2 NaN c NaN x y z0 a c a1 a c a2 a c a
如果您只想在字符串或分类列上使用它:
for col, tp in df.dtypes.items(): if tp == object or tp.name == "category": df[col] = imp.fit_transform(df[[col]])