sklearn的LabelBinarizer能像DictVectorizer一样工作吗？

我有一个数据集，其中包含数值和分类特征，这些分类特征可以包含一系列标签。例如：

RecipeId   Ingredients    TimeToPrep1          Flour, Milk    202          Milk           53          Unobtainium    100

如果每种食谱只有一个成分，DictVectorizer会优雅地处理编码成适当的虚拟变量：

from sklearn feature_extraction import DictVectorizerRecipeData=[{'RecipeID':1,'Ingredients':'Flour','TimeToPrep':20}, {'RecipeID':2,'Ingredients':'Milk','TimeToPrep':5},{'RecipeID':3,'Ingredients':'Unobtainium','TimeToPrep':100}dc=DictVectorizer()dc.fit_transform(RecipeData).toarray()

输出结果为：

array([[   1.,    0.,    0.,    1.,   20.],       [   0.,    1.,    0.,    2.,    5.],       [   0.,    0.,    1.,    3.,  100.]])

整数特征被正确处理，而分类标签被编码成布尔特征。

然而，DictVectorizer无法处理列表值特征，并且在处理以下数据时会出错：

RecipeData=[{‘RecipeID’:1,‘Ingredients’:[‘Flour’,’Milk’],’TimeToPrep’:20}, {‘RecipeID’:2,’Ingredients’:’Milk’,’TimeToPrep’:5},{‘RecipeID’:3,’Ingredients’:’Unobtainium’,’TimeToPrep’:100}

LabelBinarizer可以正确处理这种情况，但必须单独提取和处理分类变量：

from sklearn.preprocessing import LabelBinarizerlb=LabelBinarizer()lb.fit_transform([('Flour','Milk'), ('Milk',), ('Unobtainium',)])array([[1, 1, 0],       [0, 1, 0],       [0, 0, 1]])

这就是我目前的做法 – 从混合的数值/分类输入数组中提取包含标签列表的分类特征，用LabelBinarizer转换它们，然后将数值特征重新组合起来。

有没有更优雅的方法来做这件事？

回答：

LabelBinarizer 旨在处理类别标签，而不是特征（尽管经过适当的处理，它也可以处理分类特征）。

DictVectorizer 的预期用途是您对样本应用一个特定于数据的函数来提取有用的特征，该函数返回一个 dict。因此，解决这个问题的优雅方法是编写一个函数来平展您的特征字典，并用值为 True 的单个特征替换列表：

>>> def flatten_ingredients(d):...     # 就地版本...     if isinstance(d.get('Ingredients'), list):...         for ingredient in d.pop('Ingredients'):...             d['Ingredients=%s' % ingredient] = True...     return d... >>> RecipeData=[{'RecipeID':1,'Ingredients':['Flour','Milk'],'TimeToPrep':20}, {'RecipeID':2,'Ingredients':'Milk','TimeToPrep':5} ,{'RecipeID':3,'Ingredients':'Unobtainium','TimeToPrep':100}]>>> map(flatten_ingredients, RecipeData)[{'Ingredients=Milk': True, 'RecipeID': 1, 'TimeToPrep': 20, 'Ingredients=Flour': True}, {'RecipeID': 2, 'TimeToPrep': 5, 'Ingredients': 'Milk'}, {'RecipeID': 3, 'TimeToPrep': 100, 'Ingredients': 'Unobtainium'}]

实际操作中：

>>> from sklearn.feature_extraction import DictVectorizer>>> dv = DictVectorizer()>>> dv.fit_transform(flatten_ingredients(d) for d in RecipeData).toarray()array([[   1.,    1.,    0.,    1.,   20.],       [   0.,    1.,    0.,    2.,    5.],       [   0.,    0.,    1.,    3.,  100.]])>>> dv.feature_names_['Ingredients=Flour', 'Ingredients=Milk', 'Ingredients=Unobtainium', 'RecipeID', 'TimeToPrep']

（如果我是你，我也会移除 RecipeID，因为它不太可能是一个有用的特征，而且很容易导致过拟合。）

学技术

sklearn的LabelBinarizer能像DictVectorizer一样工作吗？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复