为JSON数据分配自定义类别 – pandas

不是通过get_dummies获取新的指示器列,而是为原始数据分配标签。我想要的是这样的结果:

json输入:

[{id:100,vehicle_type:”Car”, time:”2017-04-06 01:39:43″, zone=”A”, type:”Checked”},{id:101,vehicle_type:”Truck”, time:”2017-04-06 02:35:45″, zone=”B”, type:”Unchecked”},{id:102,vehicle_type:”Truck”, time:”2017-04-05 03:20:12″, zone=”A”, type:”Checked”},{id:103,vehicle_type:”Car”, time:”2017-04-04 10:05:04″, zone=”C”, type:”Unchecked”}]

结果:

  • id , vehicle_type, time_range, zone, type
  • 100, 0 , 1 , 1 , 1
  • 101, 1 , 1 , 2 , 0
  • 102, 1 , 2 , 1 , 1
  • 103, 0 , 3 , 3 , 0

时间戳 – TScolumns -> vehicle_type, type 为二进制,time_range (1 -> (TS1-TS2),2 -> (TS3-TS4), 3->(TS5-TS6)),zone-> 分类(1,2或3)。我想在将扁平化的JSON数据输入到pandas数据框时自动分配这些标签。这是可能的吗?(我不想要通过pandas的get_dummies生成的zone_1, type_1, vehicle_type_3指示器列)。如果pandas做不到,请推荐一个可以自动完成此操作的Python库。


回答:

这是我能想到的解决方案。我不知道你想要的时间范围是什么

import datetimeimport ioimport pandas as pdimport numpy as npdf_string='[{"id":100,"vehicle_type":"Car","time":"2017-04-06 01:39:43","zone":"A","type":"Checked"},{"id":101,"vehicle_type":"Truck","time":"2017-04-06 02:35:45","zone":"B","type":"Unchecked"},{"id":102,"vehicle_type":"Truck","time":"2017-04-05 03:20:12","zone":"A","type":"Checked"},{"id":103,"vehicle_type":"Car","time":"2017-04-04 10:05:04","zone":"C","type":"Unchecked"}]'df = pd.read_json(io.StringIO(df_string))df['zone'] = pd.Categorical(df.zone)df['vehicle_type'] = pd.Categorical(df.vehicle_type)df['type'] = pd.Categorical(df.type)df['zone_int'] = df.zone.cat.codesdf['vehicle_type_int'] = df.vehicle_type.cat.codesdf['type_int'] = df.type.cat.codesdf.head()

编辑这是我能想到的解决方案

import datetimeimport ioimport mathimport pandas as pd#Taken from http://stackoverflow.com/questions/13071384/python-ceil-a-datetime-to-next-quarter-of-an-hourdef ceil_dt(dt, num_seconds=900):    nsecs = dt.minute*60 + dt.second + dt.microsecond*1e-6      delta = math.ceil(nsecs / num_seconds) * num_seconds - nsecs    return dt + datetime.timedelta(seconds=delta)df_string='[{"id":100,"vehicle_type":"Car","time":"2017-04-06 01:39:43","zone":"A","type":"Checked"},{"id":101,"vehicle_type":"Truck","time":"2017-04-06 02:35:45","zone":"B","type":"Unchecked"},{"id":102,"vehicle_type":"Truck","time":"2017-04-05 03:20:12","zone":"A","type":"Checked"},{"id":103,"vehicle_type":"Car","time":"2017-04-04 10:05:04","zone":"C","type":"Unchecked"}]'df = pd.read_json(io.StringIO(df_string))df['zone'] = pd.Categorical(df.zone)df['vehicle_type'] = pd.Categorical(df.vehicle_type)df['type'] = pd.Categorical(df.type)df['zone_int'] = df.zone.cat.codesdf['vehicle_type_int'] = df.vehicle_type.cat.codesdf['type_int'] = df.type.cat.codesdf['time'] = pd.to_datetime(df.time)df['dayofweek'] = df.time.dt.dayofweekdf['month_int'] = df.time.dt.monthdf['year_int'] = df.time.dt.yeardf['day'] = df.time.dt.daydf['date'] = df.time.apply(lambda x: x.date())df['month'] = df.date.apply(lambda x: datetime.date(x.year, x.month, 1))df['year'] = df.date.apply(lambda x: datetime.date(x.year, 1, 1))df['hour'] = df.time.dt.hourdf['mins']  = df.time.dt.minutedf['seconds'] = df.time.dt.seconddf['time_interval_3hour'] = df.hour.apply(lambda x : math.floor(x/3)+1)df['time_interval_6hour'] = df.hour.apply(lambda x : math.floor(x/6)+1)df['time_interval_12hour'] = df.hour.apply(lambda x : math.floor(x/12)+1)df['weekend']  = df.dayofweek.apply(lambda x:  x>4)df['ceil_quarter_an_hour'] =df.time.apply(lambda x : ceil_dt(x))df['ceil_half_an_hour'] =df.time.apply(lambda x : ceil_dt(x, num_seconds=1800))df.head()

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注