我试图在以下数据的每个滑动窗口内统计相同ID的数量:
ID DATE 2017-05-17 15:49:51 s_2 2017-05-17 15:49:52 s_5 2017-05-17 15:49:55 s_2 2017-05-17 15:49:56 s_3 2017-05-17 15:49:58 s_52017-05-17 15:49:59 s_5
我试图在大小为3且彼此重叠的滚动窗口内统计相同ID的数量。答案应如下所示:
DATE ID s_2_count s_3_count s_5_count 2017-05-17 15:49:51 s_2 2 0 1 2017-05-17 15:49:52 s_5 1 1 1 2017-05-17 15:49:55 s_2 1 1 1 2017-05-17 15:49:56 s_3 0 1 2 2017-05-17 15:49:58 s_5 NaN NaN NaN2017-05-17 15:49:59 s_5 NaN NaN NaN
回答:
使用 str.get_dummies
, rolling
, sum
, shift
, 和 add_prefix
:
df.ID.str.get_dummies().rolling(3).sum().shift(-2).add_suffix('_count')
输出:
s_2_count s_3_count s_5_countDATE 2017-05-17 15:49:51 2.0 0.0 1.02017-05-17 15:49:52 1.0 1.0 1.02017-05-17 15:49:55 1.0 1.0 1.02017-05-17 15:49:56 0.0 1.0 2.02017-05-17 15:49:58 NaN NaN NaN2017-05-17 15:49:59 NaN NaN NaN
让我们将其重新分配给数据框:
df.assign(**df.ID.str.get_dummies().rolling(3).sum().shift(-2).add_suffix('_count'))
或者使用 join
df.join(df.ID.str.get_dummies().rolling(3).sum().shift(-2).add_suffix('_count'))
输出:
ID s_2_count s_3_count s_5_countDATE 2017-05-17 15:49:51 s_2 2.0 0.0 1.02017-05-17 15:49:52 s_5 1.0 1.0 1.02017-05-17 15:49:55 s_2 1.0 1.0 1.02017-05-17 15:49:56 s_3 0.0 1.0 2.02017-05-17 15:49:58 s_5 NaN NaN NaN2017-05-17 15:49:59 s_5 NaN NaN NaN
选项2 使用 pd.crosstab
df.assign(**pd.crosstab(df.index,df.ID).rolling(3).sum().shift(-2))
或者使用 join
df.join(pd.crosstab(df.index,df.ID).rolling(3).sum().shift(-2))