我构建了一个基于多个特征来预测房价的模型。
import statsmodels.api as statsmdlfrom sklearn import datasetsX = data[['NumberofRooms', 'YearBuilt','Type','NewConstruction']y = data["Price"]model = statsmdl.OLS(y, X).fit()predictions = model.predict(X)model.summary()
我如何找出这些特征中哪些是共线的?
回答:
您可以使用 DataFrame.corr() 方法。
示例:
In [27]: df = pd.DataFrame(np.random.randint(10, size=(5,3)), columns=list('abc'))In [28]: df['d'] = df['a'] * 10 - df['b'] / np.piIn [29]: df['e'] = np.log(df['c'] **2)In [30]: c = df.corr()In [31]: cOut[31]: a b c d ea 1.000000 0.734858 0.113787 0.999837 0.067358b 0.734858 1.000000 -0.523635 0.722485 -0.598739c 0.113787 -0.523635 1.000000 0.129945 0.984257d 0.999837 0.722485 0.129945 1.000000 0.084615e 0.067358 -0.598739 0.984257 0.084615 1.000000In [32]: c[c >= 0.7]Out[32]: a b c d ea 1.000000 0.734858 NaN 0.999837 NaNb 0.734858 1.000000 NaN 0.722485 NaNc NaN NaN 1.000000 NaN 0.984257d 0.999837 0.722485 NaN 1.000000 NaNe NaN NaN 0.984257 NaN 1.000000In [33]: c[c >= 0.7].stack().reset_index(name='cor').query("abs(cor) < 1.0")Out[33]: level_0 level_1 cor1 a b 0.7348582 a d 0.9998373 b a 0.7348585 b d 0.7224857 c e 0.9842578 d a 0.9998379 d b 0.72248511 e c 0.984257