我试图将Pandas DataFrame中的一个包含字符串的列转换为使用Scikit-Learn的OneHotEncoder进行独热编码的等价物。我的代码如下,但无法工作:
from sklearn.preprocessing import OneHotEncoder# data is a Pandas DataFramejobs_encoder = OneHotEncoder()jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
它产生了以下错误(列表中的字符串已被省略):
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-91-3a1f568322f5> in <module>() 3 jobs_encoder = OneHotEncoder() 4 jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))----> 5 data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X) 730 copy=True) 731 else:--> 732 return self._transform_new(X) 733 734 def inverse_transform(self, X):/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X) 678 """New implementation assuming categorical input""" 679 # validation of X happens in _check_X called by _transform--> 680 X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown) 681 682 n_samples, n_features = X_int.shape/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown) 120 msg = ("Found unknown categories {0} in column {1}" 121 " during transform".format(diff, i))--> 122 raise ValueError(msg) 123 else: 124 # Set the problematic rows to an acceptable value andValueError: Found unknown categories ['...', ..., '...'] in column 0 during transform
这里是一些样本数据:
data['Profession'] =0 unkn1 safe2 rece3 unkn4 lead ... 111988 indu111989 seni111990 mess111991 seni111992 projName: Profession, Length: 111993, dtype: object
我到底做错了什么?
回答:
后来发现Scikit-Learn的LabelBinarizer在将数据转换为独热编码格式方面表现得更好,借助Amnie的解决方案,我的最终代码如下
import pandas as pdfrom sklearn.preprocessing import LabelBinarizerjobs_encoder = LabelBinarizer()jobs_encoder.fit(data['Profession'])transformed = jobs_encoder.transform(data['Profession'])ohe_df = pd.DataFrame(transformed)data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)