使用Python的逻辑回归来查看哪个变量对正向预测的贡献更大

我有一组银行数据集,需要预测客户是否会接受定期存款。我有一个名为“job”的列,它是分类变量,包含每个客户的职业类型。我目前正在进行探索性数据分析(EDA),希望找出哪个职业类别对正向预测的贡献最大。

我打算使用逻辑回归来做这件事(不确定这是不是最合适的方法,欢迎提出其他方法建议)。

以下是我所做的步骤:

我对每个职业类别进行了k-热编码(为每个职业类型提供了1-0值),并且对目标变量进行了k-1热编码,提供了Target_yes的1-0值(1表示客户接受了定期存款,0表示客户未接受)。

    job_management  job_technician  job_entrepreneur    job_blue-collar     job_unknown     job_retired     job_admin.  job_services    job_self-employed   job_unemployed  job_housemaid   job_student0   1   0   0   0   0   0   0   0   0   0   0   01   0   1   0   0   0   0   0   0   0   0   0   02   0   0   1   0   0   0   0   0   0   0   0   03   0   0   0   1   0   0   0   0   0   0   0   04   0   0   0   0   1   0   0   0   0   0   0   0...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...45206   0   1   0   0   0   0   0   0   0   0   0   045207   0   0   0   0   0   1   0   0   0   0   0   045208   0   0   0   0   0   1   0   0   0   0   0   045209   0   0   0   1   0   0   0   0   0   0   0   045210   0   0   1   0   0   0   0   0   0   0   0   045211 rows × 12 columns

目标列看起来像这样:

0        01        02        03        04        0        ..45206    145207    145208    145209    045210    0Name: Target_yes, Length: 45211, dtype: int32

我将这些数据拟合到一个sklearn逻辑回归模型中,并获得了系数。由于无法解释这些系数,我寻找替代方法并发现了statsmodels版本。我使用logit函数进行了相同的操作。在网上看到的例子中,他对x变量使用了sm.add_constant。

from sklearn.linear_model import LogisticRegressionfrom sklearn import metricsmodel = LogisticRegression(solver='liblinear')model.fit(vari,tgt)model.score(vari,tgt)df = pd.DataFrame(model.coef_)df['inter'] = model.intercept_print(df)

模型的得分和print(df)的结果如下:

0.8830151954170445(model score)print(df)          0         1         2         3         4         5         6  \0 -0.040404 -0.289274 -0.604957 -0.748797 -0.206201  0.573717 -0.177778             7         8         9        10        11     inter  0 -0.530802 -0.210549  0.099326 -0.539109  0.879504 -1.795323 

当我使用sm.add_constant时,我得到了与sklearn逻辑回归相似的系数,但打算用来找出哪个职业类型对正向预测贡献最大的Z分数变成了nan。

import statsmodels.api as smlogit = sm.Logit(tgt, sm.add_constant(vari)).fit()logit.summary2()

结果是:

E:\Programs\Anaconda\lib\site-packages\numpy\core\fromnumeric.py:2495: FutureWarning:Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.E:\Programs\Anaconda\lib\site-packages\statsmodels\base\model.py:1286: RuntimeWarning:invalid value encountered in sqrtE:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:invalid value encountered in greaterE:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:invalid value encountered in lessE:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:1892: RuntimeWarning:invalid value encountered in less_equalOptimization terminated successfully.         Current function value: 0.352610         Iterations 13Model:  Logit   Pseudo R-squared:   0.023Dependent Variable:     Target_yes  AIC:    31907.6785Date:   2019-11-18 10:17    BIC:    32012.3076No. Observations:   45211   Log-Likelihood:     -15942.Df Model:   11  LL-Null:    -16315.Df Residuals:   45199   LLR p-value:    3.9218e-153Converged:  1.0000  Scale:  1.0000No. Iterations:     13.0000                           Coef.     Std.Err.    z   P>|z|   [0.025  0.975]const            -1.7968    nan     nan     nan     nan     nanjob_management   -0.0390    nan     nan     nan     nan     nanjob_technician   -0.2882    nan     nan     nan     nan     nanjob_entrepreneur -0.6092    nan     nan     nan     nan     nanjob_blue-collar  -0.7484    nan     nan     nan     nan     nanjob_unknown      -0.2142    nan     nan     nan     nan     nanjob_retired       0.5766    nan     nan     nan     nan     nanjob_admin.       -0.1766    nan     nan     nan     nan     nanjob_services     -0.5312    nan     nan     nan     nan     nanjob_self-employed   -0.2106     nan     nan     nan     nan     nanjob_unemployed  0.1011  nan     nan     nan     nan     nanjob_housemaid   -0.5427     nan     nan     nan     nan     nanjob_student     0.8857  nan     nan     nan     nan     nan

如果我不使用sm.add_constant,直接使用statsmodels的logit,我得到的系数与sklearn的逻辑回归非常不同,但Z分数是有值的(全部为负值)。

import statsmodels.api as smlogit = sm.Logit(tgt, vari).fit()logit.summary2()

结果是:

Optimization terminated successfully.         Current function value: 0.352610         Iterations 6Model:  Logit   Pseudo R-squared:   0.023Dependent Variable:     Target_yes  AIC:    31907.6785Date:   2019-11-18 10:18    BIC:    32012.3076No. Observations:   45211   Log-Likelihood:     -15942.Df Model:   11  LL-Null:    -16315.Df Residuals:   45199   LLR p-value:    3.9218e-153Converged:  1.0000  Scale:  1.0000No. Iterations:     6.0000                        Coef.     Std.Err.    z        P>|z|  [0.025  0.975]job_management  -1.8357     0.0299  -61.4917    0.0000  -1.8943     -1.7772job_technician  -2.0849     0.0366  -56.9885    0.0000  -2.1566     -2.0132job_entrepreneur -2.4060    0.0941  -25.5563    0.0000  -2.5905     -2.2215job_blue-collar  -2.5452    0.0390  -65.2134    0.0000  -2.6217     -2.4687job_unknown      -2.0110    0.1826  -11.0120    0.0000  -2.3689     -1.6531job_retired      -1.2201    0.0501  -24.3534    0.0000  -1.3183     -1.1219job_admin.       -1.9734    0.0425  -46.4478    0.0000  -2.0566     -1.8901job_services     -2.3280    0.0545  -42.6871    0.0000  -2.4349     -2.2211job_self-employed-2.0074    0.0779  -25.7739    0.0000  -2.1600     -1.8547job_unemployed   -1.6957    0.0765  -22.1538    0.0000  -1.8457     -1.5457job_housemaid    -2.3395    0.1003  -23.3270    0.0000  -2.5361     -2.1429job_student      -0.9111    0.0722  -12.6195    0.0000  -1.0526     -0.7696

这两种方法哪一种更好?或者我应该使用完全不同的方法吗?


回答:

我将这些数据拟合到一个sklearn逻辑回归模型中,并获得了系数。由于无法解释这些系数,我寻找替代方法并发现了statsmodels版本。

print(df)          0         1         2         3         4         5         6  \0 -0.040404 -0.289274 -0.604957 -0.748797 -0.206201  0.573717 -0.177778             7         8         9        10        11     inter  0 -0.530802 -0.210549  0.099326 -0.539109  0.879504 -1.795323 

解释方法如下:对数几率的指数化给你的是变量增加一个单位时的几率比。例如,如果Target_yes(1表示客户接受了定期存款,0表示客户未接受)=1,并且逻辑回归系数为0.573717,那么你可以断言“接受”的结果几率是“不接受”的结果几率的exp(0.573717) = 1.7748519304802倍。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注