我在尝试理解机器学习中多重线性回归的工作原理。我的问题在于我不知道如何正确地设置回归线,或者我的系数是否正确。
所以我想我可以把我的想法分成三个问题。
- 我找到回归线系数的方法正确吗?
- 我设置回归线的方法正确吗?
- 我的绘图方法正确吗?
我的Python 3.8.5代码如下:
from scipy import stats as stats%matplotlib inlineimport numpy as npimport matplotlib.pyplot as pltimport pandas as pddataset = pd.read_csv("cars.csv")df = dataset.fillna(dataset.mean().round(1))x_cars = df[['Weight', 'Volume']]y_cars = df['CO2']x_cars_weight = x_cars.Weightx_cars_volume = x_cars.Volume# 最佳拟合线多变量X = [x_cars_weight, x_cars_volume]A = np.column_stack([np.ones(len(x_cars_volume))] + X)Y = y_carscoeffs_multi_reversed, _, _, _ = np.linalg.lstsq(A, Y, rcond=None)coeffs_multi = coeffs_multi_reversed[::-1]# 绘图from mpl_toolkits import mplot3dfig = plt.figure()ax = plt.axes(projection='3d')z = y_carsx = x_cars_weighty = x_cars_volumec = x + yax.scatter(x, y, z, c=c)ax.set_title('$CO_2$ emission')x1 = coeffs_multi[2]*np.linspace(0,120)y1 = coeffs_multi[1]*np.linspace(0,120)z1 = x1 + y1 + coeffs_multi[0]ax.plot3D(x1, y1, z1, 'gray')ax.set_xlabel('x - Weight')ax.set_ylabel('y - Volume')ax.set_zlabel('z - $CO_2$')
我的数据列表(cars.csv)
Car,Model,Volume,Weight,CO2Toyoty,Aygo,1000,790,99Mitsubishi,Space Star,1200,1160,95Skoda,Citigo,1000,929,95Fiat,500,900,865,90Mini,Cooper,1500,1140,105VW,Up!,1000,929,105Skoda,Fabia,1400,1109,90Mercedes,A-Class,1500,1365,92Ford,Fiesta,1500,1112,98Audi,A1,1600,1150,99Hyundai,I20,1100,980,99Suzuki,Swift,1300,990,101Ford,Fiesta,1000,1112,99Honda,Civic,1600,1252,94Hundai,I30,1600,1326,97Opel,Astra,1600,1330,97BMW,1,1600,1365,99Mazda,3,2200,1280,104Skoda,Rapid,1600,1119,104Ford,Focus,2000,1328,105Ford,Mondeo,1600,1584,94Opel,Insignia,2000,1428,99Mercedes,C-Class,2100,1365,99Skoda,Octavia,1600,1415,99Volvo,S60,2000,1415,99Mercedes,CLA,1500,1465,102Audi,A4,2000,1490,104Audi,A6,2000,1725,114Volvo,V70,1600,1523,109BMW,5,2000,1705,114Mercedes,E-Class,2100,1605,115Volvo,XC70,2000,1746,117Ford,B-Max,1600,1235,104BMW,216,1600,1390,108Opel,Zafira,1600,1405,109Mercedes,SLK,2500,1395,120
回答:
依次来说,
- 方法看起来是正确的,但有点冗长。请看下面更简洁的替代方案
- 我不确定你指的是什么,但我认为这个:
x1 = coeffs_multi[2]*np.linspace(0,120)y1 = coeffs_multi[1]*np.linspace(0,120)z1 = x1 + y1 + coeffs_multi[0]
不太正确。 coeffs_multi_reversed
中的系数顺序由 X
决定,即’常数’、’重量’、’体积’。在 coeffs_multi
中,它们的顺序变为’体积’、’重量’、’常数’,所以上面的顺序是错误的
- 对于绘图,我不会使用
x1
,y1
等,而是简单地绘制模型的实际值与预测值,像这样:
...predicted = np.array(A) @ coeffs_multi_reversedax.scatter(x, y, z, label = 'actual')ax.scatter(x, y, predicted, label = 'predicted')...
- 进行回归的更标准方法如下
from sklearn.linear_model import LinearRegressionlin_regr = LinearRegression()lin_res = lin_regr.fit(x_cars, y_cars)predicted = lin_regr.predict(x_cars)print(lin_res.coef_, lin_res.intercept_)plt.plot(predicted, y_cars, '.', label = 'actual vs predicted')plt.plot(predicted, predicted, '.', label = 'predicted vs predicted')plt.legend(loc = 'best')plt.show()
输出如下
[0.00755095 0.00780526] 79.69471929115937
并绘制如下
编辑:绘制3D网格
要在网格上绘制预测输出,你可以这样做
npts = 20from mpl_toolkits import mplot3dfig = plt.figure()ax = plt.axes(projection='3d')x = x_cars['Weight']y = x_cars['Volume']ax.scatter(x, y, z, label = 'actual')x1 = np.linspace(x.min(), x.max(), npts)y1 = np.linspace(y.min(), y.max(), npts)x1m,y1m = np.meshgrid(x1,y1)z1 = lin_regr.predict(np.hstack([x1m.reshape(-1,1),y1m.reshape(-1,1)]))ax.scatter(x1m.reshape(-1,1), y1m.reshape(-1,1), z1, '.', s=1, label = 'predicted')ax.set_xlabel('x - Weight')ax.set_ylabel('y - Volume')ax.set_zlabel('z - $CO_2$')ax.set_title('$CO_2$ emission')plt.legend(loc = 'best')plt.show()
得到这种输出: