我有以下分类数据:
['Self employed', 'Government Dependent', 'Formally employed Private', 'Informally employed', 'Formally employed Government', 'Farming and Fishing', 'Remittance Dependent', 'Other Income', 'Don't Know/Refuse to answer', 'No Income']
如何将它们放入箱中,使得:
['Government Dependent','Formally employed Government','Formally employed Private'] = 0 ['Remittance Dependent', 'Informally employed','Self employed','Other Income'] = 1 ['Dont Know/Refuse to answer', 'No Income','Farming and Fishing'] = 2
我已经知道如何将数值数据放入分类箱中……反过来可以做吗?
TRAIN = pd.read_csv("Train_v2.csv")TRAIN['job_type'].unique()output:array(['Self employed', 'Government Dependent', 'Formally employed Private', 'Informally employed', 'Formally employed Government', 'Farming and Fishing', 'Remittance Dependent', 'Other Income', 'Dont Know/Refuse to answer', 'No Income'], dtype=object)
回答:
首先创建字典,通过交换进行修改,最后使用 Series.map
:
a = ['Self employed', 'Government Dependent', 'Formally employed Private', 'Informally employed', 'Formally employed Government', 'Farming and Fishing', 'Remittance Dependent', 'Other Income', 'Dont Know/Refuse to answer', 'No Income']TRAIN = pd.DataFrame({'job_type':a})
#向字典中添加其他组d = {0: ['Government Dependent','Formally employed Government','Formally employed Private'], 1: ['Remittance Dependent', 'Informally employed'], 2: ["Don't Know/Refuse to answer", 'No Income']}#交换字典中的键值#http://stackoverflow.com/a/31674731/2901002d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}TRAIN['new'] = TRAIN['job_type'].map(d1)print (TRAIN) job_type new0 Self employed NaN1 Government Dependent 0.02 Formally employed Private 0.03 Informally employed 1.04 Formally employed Government 0.05 Farming and Fishing NaN6 Remittance Dependent 1.07 Other Income NaN8 Dont Know/Refuse to answer NaN9 No Income 2.0
如果输出中只有 0
, 1
和 NaN
,使用 numpy.select
也能工作,但如果有许多组,这会变得复杂且速度较慢:
m1 = TRAIN['job_type'].isin(['Government Dependent','Formally employed Government','Formally employed Private'])m2 = TRAIN['job_type'].isin(['Remittance Dependent', 'Informally employed'])m3 = TRAIN['job_type'].isin(["Don't Know/Refuse to answer", 'No Income'])TRAIN['new'] = np.select([m1, m2, m3], [0, 1, 2], np.nan)