左侧的CSV文件有五列,其中.application
列包含了几种应用程序类型,并用;
分隔。根据app
、device
和district
类型,我希望预测target
。但首先,我想将文件转换为右侧的数据框架,以便应用机器学习。
我该如何使用Python来实现这一点?
回答:
你需要对application
列应用多热编码,对其他列应用单热编码。
这是我的解决方案!
>>> import pandas as pd>>> import numpy as np>>> df = pd.DataFrame({'number': np.random.randint(0,10,size=5), 'device': np.random.choice(['a','b'],size=5), 'application': ['app2;app3','app1','app2;app4', 'app1;app2', 'app1'], 'district': np.random.choice(['aa', 'bb', 'cc'],size=5)})>>> df application device district number0 app2;app3 b aa 31 app1 a cc 72 app2;app4 a aa 33 app1;app2 b bb 94 app1 a cc 4from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizermlb = MultiLabelBinarizer()# Assuming appl names are separated by ;mhv = mlb.fit_transform(df['application'].apply(lambda x: set(x.split(';'))))df_out = pd.DataFrame(mhv,columns=mlb.classes_)enc = OneHotEncoder(sparse=False)ohe_vars = ['device','district'] # specify the list of columns hereohv = enc.fit_transform(df.loc[:,ohe_vars])ohe_col_names = ['%s_%s'%(var,cat) for var,cats in zip(ohe_vars, enc.categories_) for cat in cats]df_out.assign(**dict(zip(ohe_col_names,ohv.T)))df_out