假设有以下DataFrame:
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'], 'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'], 'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'], 'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'], 'number':[0, 0.001, 0, 0, 0, 0], 'age':[24, 22, 45, 38, 60, 32]}df=pd.DataFrame(d)
当尝试获取所有列的方差时,会得到以下结果:
import pandas as pddf.agg("var")
结果:
number 1.666667e-07age 2.025667e+02dtype: float64
想法是删除方差在某个范围内的列,比如,如果列的方差在0
到0.0001
之间,就删除该列(例如,删除number
列,因为它的方差在这个范围内)。
当尝试这样做时:
df= df.loc[:, 0 < df.std() < .0001]
会出现以下错误:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
是否可以删除方差在容忍范围内的pandas DataFrame列?
回答:
另一种解决方案(使用.between
+ .drop(columns=...)
)
var = df.agg("var", numeric_only=True)df = df.drop(columns=var[var.between(0, 0.0001)].index)print(df)
输出:
month country level job title age0 01/01/2020 Japan A01 Insights Manager 241 01/02/2020 Japan A01 Insights Manager 222 01/03/2020 Japan A01 Insights Manager 453 01/01/2020 Poland A00 Sales Director 384 01/02/2020 Poland A00 Sales Director 605 01/03/2020 Poland A00 Sales Director 32