我刚开始学习 Python,以前一直使用 R。我发现当我在 R 中构建一个简单的回归模型时,与在 iPython 中执行相同操作时,得到的结果非常不同。
R 平方值、P 值、系数的显著性 — 没有任何一个匹配。我是读错了输出,还是犯了其他基本错误?
以下是我在 R 和 Python 中的代码和结果:
R 代码
str(df_nv)Classes 'tbl_df', 'tbl' and 'data.frame': 81 obs. of 2 variables: $ Dependent Variabls : num 733 627 405 353 434 556 381 558 612 901 ... $ Independent Variable: num 0.193 0.167 0.169 0.14 0.145 ...summary(lm(`Dependent Variable` ~ `Independent Variable`, data = df_nv))Call: lm(formula = `Dependent Variable` ~ `Independent Variable`, data = df_nv)Residuals: Min 1Q Median 3Q Max -501.18 -139.20 -82.61 -15.82 2136.74 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 478.2 148.2 3.226 0.00183 **`Independent Variable` -196.1 1076.9 -0.182 0.85601 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 381.5 on 79 degrees of freedomMultiple R-squared: 0.0004194, Adjusted R-squared: -0.01223 F-statistic: 0.03314 on 1 and 79 DF, p-value: 0.856
iPython Notebook 代码
df_nv.dtypesDependent Variable float64Independent Variable float64dtype: objectmodel = sm.OLS(df_nv['Dependent Variable'], df_nv['Independent Variable'])results = model.fit()results.summary()OLS Regression ResultsDep. Variable: Dependent Variable R-squared: 0.537Model: OLS Adj. R-squared: 0.531Method: Least Squares F-statistic: 92.63Date: Fri, 20 Jan 2017 Prob (F-statistic): 5.23e-15Time: 09:08:54 Log-Likelihood: -600.40No. Observations: 81 AIC: 1203.Df Residuals: 80 BIC: 1205.Df Model: 1 Covariance Type: nonrobust coef std err t P>|t| [95.0% Conf. Int.]Independent Variable 3133.1825 325.537 9.625 0.000 2485.342 3781.023Omnibus: 89.595 Durbin-Watson: 1.940Prob(Omnibus): 0.000 Jarque-Bera (JB): 980.289Skew: 3.489 Prob(JB): 1.36e-213Kurtosis: 18.549 Cond. No. 1.00
供参考,R 和 Python 中数据框的前几行数据:
R:
head(df_nv) Dependent Variable Independent Variable <dbl> <dbl>1 733 0.19323672 627 0.16666673 405 0.16861834 353 0.13986015 434 0.14492756 556 0.1475410
Python:
df_nv.head() Dependent Variable Independent Variable5292 733.0 0.1932375320 627.0 0.1666675348 405.0 0.1686185404 353.0 0.1398605460 434.0 0.144928
回答:
以下是使用 python pandas
(使用 statsmodels.formula.api
)和 R
对 gapminder
数据集进行线性回归的结果,它们完全相同:
R 代码
df <- read.csv('gapminder.csv')df <- df[c('internetuserate', 'urbanrate')]df <- df[complete.cases(df),]dim(df)# [1] 190 2m <- lm(internetuserate~urbanrate, df)summary(m)#Call:#lm(formula = internetuserate ~ urbanrate, data = df)#Residuals:# Min 1Q Median 3Q Max #-51.474 -15.857 -3.954 14.305 74.590 #Coefficients:# Estimate Std. Error t value Pr(>|t|) #(Intercept) -4.90375 4.11485 -1.192 0.235 #urbanrate 0.72022 0.06753 10.665 <2e-16 ***#---#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1# #Residual standard error: 22.03 on 188 degrees of freedom#Multiple R-squared: 0.3769, Adjusted R-squared: 0.3736 #F-statistic: 113.7 on 1 and 188 DF, p-value: < 2.2e-16