我有一些pandas.Series
数据 – 下面称为s
,我想对其进行独热编码。通过研究,我发现'b'
级别对于我的预测建模任务并不重要。我可以像这样将其排除在我的分析之外:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='error')
enc.fit_transform(s)
# array([[1., 0.],
# [0., 0.],
# [0., 1.]])
enc.get_feature_names()
# array(['x0_a', 'x0_c'], dtype=object)
但是当我尝试转换一个新的系列,其中包含'b'
和一个新的级别'd'
时,我会得到一个错误:
new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
enc.transform(new_s)
Traceback (most recent call last):
File “”, line 1, in
File “/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py”, line 390, in transform
X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
File “/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py”, line 124, in _transform
raise ValueError(msg)
ValueError: Found unknown categories [‘d’] in column 0 during transform
这是可以预期的,因为我之前设置了handle_unknown='error'
。然而,我希望在拟合和后续转换步骤中完全忽略除了['a', 'c']
之外的所有类别。我尝试了这个方法:
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='ignore')
enc.fit_transform(s)
enc.transform(new_s)
Traceback (most recent call last):
File “”, line 1, in
File “/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py”, line 371, in fit_transform
self._validate_keywords()
File “/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py”, line 289, in _validate_keywords
“handle_unknown
must be ‘error’ when the drop parameter is ”
ValueError:handle_unknown
must be ‘error’ when the drop parameter is specified, as both would create categories that are all zero.
看起来这种模式在scikit-learn中不被支持。有人知道如何用scikit-learn兼容的方式来完成这个任务吗?
回答:
你也可以尝试以下方法:
class IgnorantOneHotEncoder(OneHotEncoder):
def transform(self, X, y=None):
try:
return super().transform(X)
except ValueError as e:
if 'Found unknown categories' in str(e):
X = np.copy(X)
# 跟踪未知类别的索引
unknown_categories_mask = ~np.isin(X, self.categories_[0]).ravel()
# 用第一个已知类别覆盖输入矩阵X中的未知类别
X[unknown_categories_mask] = self.categories_[0][0]
# 现在所有类别都是已知的,转换X
X = super().transform(X)
# 将原始未知类别的记录覆盖为0,表示该特征没有任何类别的值
X[unknown_categories_mask, 0] = 0
return X
else:
raise
试试看:
>>> ienc = IgnorantOneHotEncoder(sparse=False)
>>> ienc.fit(s)
IgnorantOneHotEncoder(sparse=False)
>>> ienc.transform(s)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>> ienc.transform(new_s)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])