我在尝试手动计算WoE,但无法得到与category_encoders的WOEEncoder计算出的相同结果。以下是我想要计算分数的数据框:
df = pd.DataFrame({'cat': ['a', 'b', 'a', 'b', 'a', 'a', 'b', 'c', 'c'], 'target': [1, 0, 0, 1, 0, 0, 1, 1, 0]})
这是我用来计算WoE分数的代码:
woe = WOEEncoder(cols=['cat'], random_state=42)X = df['cat']y = df.targetencoded_df = woe.fit_transform(X, y)
结果是:
0 -0.5389971 0.5596162 -0.5389973 0.5596164 -0.5389975 -0.5389976 0.5596167 0.1541518 0.154151
因此,’a’被编码为-0.538997,’b’被编码为0.559616,’c’被编码为0.154151。
当我手动计算这些分数时,结果不同,我使用的是
ln(% of non events / % of events).
例如,计算’a’的WoE时,
% of non events = targets which are 0 for 'a'/ total targets for group 'a'
所以,% of non events = 3/4 = 0.75
% of events = targets which are 1 for 'a' / total targets for group 'a'
所以,% of events = 1/4 = 0.25现在,0.75/0.25 = 3
因此,WoE(a) = ln(3) = 1.09,这与上面的编码器结果不同。
回答:
由于这是一个开源项目,可以查看函数的代码:
http://contrib.scikit-learn.org/category_encoders/_modules/category_encoders/woe.html#WOEEncoder
在您的代码中,要得到与WOEEncoder相似的结果,主要有两个问题:
-
WOEEncoder有一个默认值为1的’regularization’参数。您应该创建一个regularization=0的WOEEncoder对象以获得相同的结果
-
第二个问题是您对woe公式的解释有误。正确的公式(在WOEEncoder中实现的)对于’a’的情况应该是:
% of non events = targets which are 0 for ‘a’ / total targets which are 0
% of events = targets which are 1 for ‘a’ / total targets which are 1
owe = ln(% of events / % of non events )
这对于’a’的情况会产生:
% of non events = 3/5% of events = 1/4ln(% of events / % of non events ) = ln(5/12) = -0.8754687373538999
如果您执行修改后的代码:
woe = WOEEncoder(cols=['cat'], random_state=42, regularization=0)X = df['cat']y = df.targetencoded_df = woe.fit_transform(X, y)
您将看到相似的结果:
0 -0.8754691 0.9162912 -0.8754693 0.9162914 -0.8754695 -0.8754696 0.9162917 0.2231448 0.223144