我正在使用Kaggle的泰坦尼克号数据集学习机器学习。我使用sklearn的LabelEncoder将文本数据转换为数字标签。以下代码对”Sex”字段有效,但对”Embarked”字段无效。
encoder = preprocessing.LabelEncoder()features["Sex"] = encoder.fit_transform(features["Sex"])features["Embarked"] = encoder.fit_transform(features["Embarked"])
这是我得到的错误信息
Traceback (most recent call last): File "../src/script.py", line 20, in <module> features["Embarked"] = encoder.fit_transform(features["Embarked"]) File "/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 131, in fit_transform self.classes_, y = np.unique(y, return_inverse=True) File "/opt/conda/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 211, in unique perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')TypeError: '>' not supported between instances of 'str' and 'float'
回答:
我自己解决了这个问题。问题在于该特征包含NaN值。用数值替换它仍然会抛出错误,因为数据类型不同。所以我用字符值替换了它
features["Embarked"] = encoder.fit_transform(features["Embarked"].fillna('0'))