我有一个数据框 visit
=
visit_occurrence_id visit_start_date person_id 1 2016-06-01 1 2 2019-05-01 2 3 2016-01-22 1 4 2017-02-14 2 5 2018-05-11 3
还有另一个数据框 measurement
=
measurement_date person_id visit_occurrence_id2017-09-04 1 Nan2018-04-24 2 Nan2018-05-22 2 Nan2019-02-02 1 Nan2019-01-28 3 Nan2019-05-07 1 Nan2018-12-11 3 Nan2017-04-28 3 Nan
我想根据 person_id 和尽可能接近的日期,用 visit 数据框中的 visit_occurrence_id 来填充 measurement 数据框的 visit_occurrence_id。
我写了一段代码,但它运行时间很长。
measurement 有 7*10^5 行。
注意:visit_start_date 和 measurement_date 是对象类型
我的代码 - import datetime as dtunique_person_list = measurement['person_id'].unique().tolist()def nearest_date(row,date_list): date_list = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in date_list] row = min(date_list, key=lambda x: abs(x - row)) return rowmodified_measurement = pd.DataFrame(columns = measurement.columns)for person in unique_person_list: near_visit_dates = visit[visit['person_id']==person]['visit_start_date'].tolist() if near_visit_dates: near_visit_dates = list(filter(None, near_visit_dates)) near_visit_dates = [i.strftime('%Y-%m-%d') for i in near_visit_dates] store_dates = measurement.loc[measurement['person_id']== person]['measurement_date'] store_dates= store_dates.apply(nearest_date, args=(near_visit_dates,)) modified_measurement = modified_measurement.append(store_dates)
我的代码执行时间相当长。你能帮我降低时间复杂度或提供另一种解决方案吗?
编辑 – 添加数据框构造函数。
import numpy as npmeasurement = {'measurement_date':['2017-09-04', '2018-04-24', '2018-05-22', '2019-02-02', '2019-01-28', '2019-05-07', '2018-12-11','2017-04-28'], 'person_id':[1, 2, 2, 1, 3, 1, 3, 3],'visit_occurrence_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}visit = {'visit_occurrence_id':[1, 2, 3, 4, 5], 'visit_start_date':['2016-06-01', '2019-05-01', '2016-01-22', '2017-02-14', '2018-05-11'], 'person_id':[1, 2, 1, 2, 3]}# 创建数据框measurement = pd.DataFrame(measurement)visit = pd.DataFrame(visit)
回答:
你可以这样做:
df=pd.merge(measurement[["person_id", "measurement_date"]], visit, on="person_id", how="inner")df["dt_diff"]=df[["visit_start_date", "measurement_date"]].apply(lambda x: abs(datetime.datetime.strptime(x["visit_start_date"], '%Y-%m-%d').date() - datetime.datetime.strptime(x["measurement_date"], '%Y-%m-%d').date()), axis=1)df=pd.merge(df, df.groupby(["person_id", "measurement_date"])["dt_diff"].min(), on=["person_id", "dt_diff", "measurement_date"], how="inner")res=pd.merge(measurement, df, on=["measurement_date", "person_id"], suffixes=["", "_2"])[["measurement_date", "person_id", "visit_occurrence_id_2"]]
输出:
measurement_date person_id visit_occurrence_id_20 2017-09-04 1 11 2018-04-24 2 22 2018-05-22 2 23 2019-02-02 1 14 2019-01-28 3 55 2019-05-07 1 16 2018-12-11 3 57 2017-04-28 3 5