我有一组银行数据集,需要预测客户是否会接受定期存款。我有一个名为“job”的列,它是分类变量,包含每个客户的职业类型。我目前正在进行探索性数据分析(EDA),希望找出哪个职业类别对正向预测的贡献最大。
我打算使用逻辑回归来做这件事(不确定这是不是最合适的方法,欢迎提出其他方法建议)。
以下是我所做的步骤:
我对每个职业类别进行了k-热编码(为每个职业类型提供了1-0值),并且对目标变量进行了k-1热编码,提供了Target_yes的1-0值(1表示客户接受了定期存款,0表示客户未接受)。
job_management job_technician job_entrepreneur job_blue-collar job_unknown job_retired job_admin. job_services job_self-employed job_unemployed job_housemaid job_student0 1 0 0 0 0 0 0 0 0 0 0 01 0 1 0 0 0 0 0 0 0 0 0 02 0 0 1 0 0 0 0 0 0 0 0 03 0 0 0 1 0 0 0 0 0 0 0 04 0 0 0 0 1 0 0 0 0 0 0 0... ... ... ... ... ... ... ... ... ... ... ... ...45206 0 1 0 0 0 0 0 0 0 0 0 045207 0 0 0 0 0 1 0 0 0 0 0 045208 0 0 0 0 0 1 0 0 0 0 0 045209 0 0 0 1 0 0 0 0 0 0 0 045210 0 0 1 0 0 0 0 0 0 0 0 045211 rows × 12 columns
目标列看起来像这样:
0 01 02 03 04 0 ..45206 145207 145208 145209 045210 0Name: Target_yes, Length: 45211, dtype: int32
我将这些数据拟合到一个sklearn逻辑回归模型中,并获得了系数。由于无法解释这些系数,我寻找替代方法并发现了statsmodels版本。我使用logit函数进行了相同的操作。在网上看到的例子中,他对x变量使用了sm.add_constant。
from sklearn.linear_model import LogisticRegressionfrom sklearn import metricsmodel = LogisticRegression(solver='liblinear')model.fit(vari,tgt)model.score(vari,tgt)df = pd.DataFrame(model.coef_)df['inter'] = model.intercept_print(df)
模型的得分和print(df)的结果如下:
0.8830151954170445(model score)print(df) 0 1 2 3 4 5 6 \0 -0.040404 -0.289274 -0.604957 -0.748797 -0.206201 0.573717 -0.177778 7 8 9 10 11 inter 0 -0.530802 -0.210549 0.099326 -0.539109 0.879504 -1.795323
当我使用sm.add_constant时,我得到了与sklearn逻辑回归相似的系数,但打算用来找出哪个职业类型对正向预测贡献最大的Z分数变成了nan。
import statsmodels.api as smlogit = sm.Logit(tgt, sm.add_constant(vari)).fit()logit.summary2()
结果是:
E:\Programs\Anaconda\lib\site-packages\numpy\core\fromnumeric.py:2495: FutureWarning:Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.E:\Programs\Anaconda\lib\site-packages\statsmodels\base\model.py:1286: RuntimeWarning:invalid value encountered in sqrtE:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:invalid value encountered in greaterE:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:invalid value encountered in lessE:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:1892: RuntimeWarning:invalid value encountered in less_equalOptimization terminated successfully. Current function value: 0.352610 Iterations 13Model: Logit Pseudo R-squared: 0.023Dependent Variable: Target_yes AIC: 31907.6785Date: 2019-11-18 10:17 BIC: 32012.3076No. Observations: 45211 Log-Likelihood: -15942.Df Model: 11 LL-Null: -16315.Df Residuals: 45199 LLR p-value: 3.9218e-153Converged: 1.0000 Scale: 1.0000No. Iterations: 13.0000 Coef. Std.Err. z P>|z| [0.025 0.975]const -1.7968 nan nan nan nan nanjob_management -0.0390 nan nan nan nan nanjob_technician -0.2882 nan nan nan nan nanjob_entrepreneur -0.6092 nan nan nan nan nanjob_blue-collar -0.7484 nan nan nan nan nanjob_unknown -0.2142 nan nan nan nan nanjob_retired 0.5766 nan nan nan nan nanjob_admin. -0.1766 nan nan nan nan nanjob_services -0.5312 nan nan nan nan nanjob_self-employed -0.2106 nan nan nan nan nanjob_unemployed 0.1011 nan nan nan nan nanjob_housemaid -0.5427 nan nan nan nan nanjob_student 0.8857 nan nan nan nan nan
如果我不使用sm.add_constant,直接使用statsmodels的logit,我得到的系数与sklearn的逻辑回归非常不同,但Z分数是有值的(全部为负值)。
import statsmodels.api as smlogit = sm.Logit(tgt, vari).fit()logit.summary2()
结果是:
Optimization terminated successfully. Current function value: 0.352610 Iterations 6Model: Logit Pseudo R-squared: 0.023Dependent Variable: Target_yes AIC: 31907.6785Date: 2019-11-18 10:18 BIC: 32012.3076No. Observations: 45211 Log-Likelihood: -15942.Df Model: 11 LL-Null: -16315.Df Residuals: 45199 LLR p-value: 3.9218e-153Converged: 1.0000 Scale: 1.0000No. Iterations: 6.0000 Coef. Std.Err. z P>|z| [0.025 0.975]job_management -1.8357 0.0299 -61.4917 0.0000 -1.8943 -1.7772job_technician -2.0849 0.0366 -56.9885 0.0000 -2.1566 -2.0132job_entrepreneur -2.4060 0.0941 -25.5563 0.0000 -2.5905 -2.2215job_blue-collar -2.5452 0.0390 -65.2134 0.0000 -2.6217 -2.4687job_unknown -2.0110 0.1826 -11.0120 0.0000 -2.3689 -1.6531job_retired -1.2201 0.0501 -24.3534 0.0000 -1.3183 -1.1219job_admin. -1.9734 0.0425 -46.4478 0.0000 -2.0566 -1.8901job_services -2.3280 0.0545 -42.6871 0.0000 -2.4349 -2.2211job_self-employed-2.0074 0.0779 -25.7739 0.0000 -2.1600 -1.8547job_unemployed -1.6957 0.0765 -22.1538 0.0000 -1.8457 -1.5457job_housemaid -2.3395 0.1003 -23.3270 0.0000 -2.5361 -2.1429job_student -0.9111 0.0722 -12.6195 0.0000 -1.0526 -0.7696
这两种方法哪一种更好?或者我应该使用完全不同的方法吗?
回答:
我将这些数据拟合到一个sklearn逻辑回归模型中,并获得了系数。由于无法解释这些系数,我寻找替代方法并发现了statsmodels版本。
print(df) 0 1 2 3 4 5 6 \0 -0.040404 -0.289274 -0.604957 -0.748797 -0.206201 0.573717 -0.177778 7 8 9 10 11 inter 0 -0.530802 -0.210549 0.099326 -0.539109 0.879504 -1.795323
解释方法如下:对数几率的指数化给你的是变量增加一个单位时的几率比。例如,如果Target_yes(1表示客户接受了定期存款,0表示客户未接受)=1,并且逻辑回归系数为0.573717,那么你可以断言“接受”的结果几率是“不接受”的结果几率的exp(0.573717) = 1.7748519304802倍。