在机器学习中，非层次类别特征的最佳编码方式是什么？

对于字符串特征，当顺序无关紧要时，是使用get dummies还是OneHotEncoder更好？

例如，在这个pandas数据框中：

df_with_cat = pd.DataFrame({'A': ['ios', 'android', 'web', 'NaN', 'ios','ios', 'NaN', 'android'], 'B' : [4, 4, 'NaN', 2, 'NaN', 3, 3, 'NaN']})df_with_cat.head()    A        B---------------0   ios      41   android  42   web      NaN3   NaN      24   ios      NaN5   ios      36   NaN      37   android  NaN

我知道现在为了处理它们（填补缺失值等），我必须对它们进行编码，像这样：

from sklearn.preprocessing import LabelEncoderdf_with_cat_orig = df_with_cat.copy()la_encoder = LabelEncoder()df_with_cat['A'] = la_encoder.fit_transform(df_with_cat.A)

输出结果：

df_with_cat.head(10)    A   B-----------0   2   41   1   42   3   NaN3   0   24   2   NaN5   2   36   0   37   1   NaN

但现在看起来像是存在从0到3的某种顺序，但事实并非如此… 'ios' ->2 并不一定大于 'android' ->1

回答：

我刚刚得到了上面问题的答案（与下方黄色标记的相关）：

当你将它们编码为数字并将它们全部保留为单一特征时，模型会假设顺序意味着什么，因此认为 ‘ios’（映射为2）大于 ‘android’（等于1）

但现在看起来像是存在从0到3的某种顺序，但事实并非如此… ‘ios’ ->2 并不一定大于 ‘android’ ->1

如果对于特定特征类别数量不多，使用get dummies很容易：

data_with_dummies = pd.get_dummies(df_with_cat, columns=['A'], drop_first=True)        B A_1 A_2   A_3------------------------    0   4   0   1   0    1   4   1   0   0    2   NaN 0   0   1    3   2   0   0   0    4   NaN 0   1   0    5   3   0   1   0    6   3   0   0   0    7   NaN 1   0

现在我们避免了最初提到的问题，这应该会显著提高模型的性能

或者直接使用OneHotEncoder – 正如@Primusa 在上面的回答中所说

学技术

在机器学习中，非层次类别特征的最佳编码方式是什么？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复