我在为机器学习算法生成一些特征,并希望从数据框中计算一些统计数据,类似于describe()
函数的功能。
以下是示例代码:
df = pd.DataFrame({'A' : [1,np.nan,3], 'B' : [20,30,40]})print(df)df_t = df.describe()print(type(df_t))print(df_t)print(df_t.columns)print(df_t.index)
输出结果:
A B0 1.0 201 NaN 302 3.0 40<class 'pandas.core.frame.DataFrame'> A Bcount 2.000000 3.0mean 2.000000 30.0std 1.414214 10.0min 1.000000 20.025% 1.500000 25.050% 2.000000 30.075% 2.500000 35.0max 3.000000 40.0Index(['A', 'B'], dtype='object')Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')
因此,这里有几个问题:
-
如何将
describe
函数的结果重塑为一行,并使用类似A_count,A_mean,...,B_75%,B_max
的名称? -
如果我想使用自定义函数而不是
describe
,例如添加np.median
和np.percentile
以计算20%和80%的百分位数,最佳的方法是什么?
回答:
第一个问题的解决方案(不确定是否已经有人想到了):
df = pd.DataFrame({'A' : [1,np.nan,3], 'B' : [20,30,40]})print(df)df_t = df.describe()print(type(df_t))print(df_t)print(df_t.columns)print(df_t.index)col_names = []for stat_name in df_t.index: for col_name in df_t.columns: col_names.append(str(col_name)+'_'+str(stat_name))print('col_names',col_names)N = len(col_names)print('len(col_names)', N)row = df_t.values.reshape(1,N)print('row.shape',row.shape)df_stat = pd.DataFrame(data=row, columns=col_names)print(df_stat)
输出结果:
A B0 1.0 201 NaN 302 3.0 40<class 'pandas.core.frame.DataFrame'> A Bcount 2.000000 3.0mean 2.000000 30.0std 1.414214 10.0min 1.000000 20.025% 1.500000 25.050% 2.000000 30.075% 2.500000 35.0max 3.000000 40.0Index(['A', 'B'], dtype='object')Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')col_names ['A_count', 'B_count', 'A_mean', 'B_mean', 'A_std', 'B_std', 'A_min', 'B_min', 'A_25%', 'B_25%', 'A_50%', 'B_50%', 'A_75%', 'B_75%', 'A_max', 'B_max']len(col_names) 16row.shape (1, 16) A_count B_count A_mean B_mean A_std B_std A_min B_min A_25% \0 2.0 3.0 2.0 30.0 1.414214 10.0 1.0 20.0 1.5 B_25% A_50% B_50% A_75% B_75% A_max B_max 0 25.0 2.0 30.0 2.5 35.0 3.0 40.0
第一个问题基于Andy Hayden的回答的另一个解决方案:
df = pd.DataFrame({'A' : [1,np.nan,3], 'B' : [20,30,40]})print(df)df_t = df.describe()print(type(df_t))print(df_t)print(df_t.columns)print(df_t.index)df_s = df_t.stack()print(type(df_s))print(df_s)print(df_s.shape)df_s.index = df_s.index.map(lambda x : '_'.join(x[::-1]))print(type(df_s))print(df_s)df_s = df_s.to_frame().Tprint(type(df_s))print(df_s)
输出结果:
A B0 1.0 201 NaN 302 3.0 40<class 'pandas.core.frame.DataFrame'> A Bcount 2.000000 3.0mean 2.000000 30.0std 1.414214 10.0min 1.000000 20.025% 1.500000 25.050% 2.000000 30.075% 2.500000 35.0max 3.000000 40.0Index(['A', 'B'], dtype='object')Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')<class 'pandas.core.series.Series'>count A 2.000000 B 3.000000mean A 2.000000 B 30.000000std A 1.414214 B 10.000000min A 1.000000 B 20.00000025% A 1.500000 B 25.00000050% A 2.000000 B 30.00000075% A 2.500000 B 35.000000max A 3.000000 B 40.000000dtype: float64(16,)<class 'pandas.core.series.Series'>A_count 2.000000B_count 3.000000A_mean 2.000000B_mean 30.000000A_std 1.414214B_std 10.000000A_min 1.000000B_min 20.000000A_25% 1.500000B_25% 25.000000A_50% 2.000000B_50% 30.000000A_75% 2.500000B_75% 35.000000A_max 3.000000B_max 40.000000dtype: float64<class 'pandas.core.frame.DataFrame'> A_count B_count A_mean B_mean A_std B_std A_min B_min A_25% \0 2.0 3.0 2.0 30.0 1.414214 10.0 1.0 20.0 1.5 B_25% A_50% B_50% A_75% B_75% A_max B_max 0 25.0 2.0 30.0 2.5 35.0 3.0 40.0
关于第二个问题,我设法这样做了(然而代码不是很漂亮),请注意'min','max','sum'
函数仅为示例,最初的想法是扩展describe
的功能:
df = pd.DataFrame({'A' : [1,np.nan,3], 'B' : [20,30,40]})print(df)def func(df, func_name): if func_name == 'max': df_t = df.max(axis=0) elif func_name == 'min': df_t = df.min(axis=0) elif func_name == 'sum': df_t = df.sum(axis=0) else: raise NotImplementedError df_t = df_t.to_frame().T print(type(df_t)) print(df_t) df_t.rename(columns=lambda x: x+'_'+func_name,inplace=True) print(type(df_t)) print(df_t) return df_tfunc_names = ['min','max','sum']df_list = []for func_name in func_names: df_t = func(df, func_name) df_list.append(df_t)df_stat = pd.concat(df_list, axis=1)print(df_stat)
输出结果:
A B0 1.0 201 NaN 302 3.0 40<class 'pandas.core.frame.DataFrame'> A B0 1.0 20.0<class 'pandas.core.frame.DataFrame'> A_min B_min0 1.0 20.0<class 'pandas.core.frame.DataFrame'> A B0 3.0 40.0<class 'pandas.core.frame.DataFrame'> A_max B_max0 3.0 40.0<class 'pandas.core.frame.DataFrame'> A B0 4.0 90.0<class 'pandas.core.frame.DataFrame'> A_sum B_sum0 4.0 90.0 A_min B_min A_max B_max A_sum B_sum0 1.0 20.0 3.0 40.0 4.0 90.0