问题陈述:预测客户订购特定商品(例如:靴子、运动鞋等)后快递包裹的重量
因此,我的数据框架由历史数据组成,其中产品项目类别(例如:靴子、运动鞋等)构成特征,而重量是我要预测的’y’变量。数据框架的每一行包含客户订购的产品项目类别的数量。
示例:客户订购1双靴子,1双运动鞋。一行看起来像这样:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 y1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2.94
其中一个特征是items_total,这里是x47(客户总共订购了多少件商品)。
我使用以下代码创建了一个线性模型:
regr_model = linear_model.LinearRegression()
在将数据框架分割成训练集和测试集后,我使用regr_model.fit(x_train, y_train)
运行模型
当我查看系数时,我得到了以下输出(为了更易理解进行了格式化)
1 feature x1 6494532107.689080 (这是items_total特征)2 feature x2 (-6494532105.548431)3 feature x3 (-6494532105.956598)4 feature x4 (-6494532105.987348)5 feature x5 (-6494532106.081478)6 feature x6 (-6494532106.139558)7 feature x7 (-6494532106.163167)8 feature x8 (-6494532106.326231)9 feature x9 (-6494532106.360985)10 feature x10 (-6494532106.507434)11 feature x11 (-6494532106.678183)12 feature x12 (-6494532106.711108)13 feature x13 (-6494532106.906321)14 feature x14 (-6494532106.916800)15 feature x15 (-6494532106.941691)16 feature x16 (-6494532107.049221)17 feature x17 (-6494532107.071664)18 feature x18 (-6494532107.076819)19 feature x19 (-6494532107.095350)20 feature x20 (-6494532107.124458)21 feature x21 (-6494532107.208526)22 feature x22 (-6494532107.291896)23 feature x23 (-6494532107.315606)24 feature x24 (-6494532107.319578)25 feature x25 (-6494532107.322818)26 feature x26 (-6494532107.337678)27 feature x27 (-6494532107.345344)28 feature x28 (-6494532107.347136)29 feature x29 (-6494532107.374278)30 feature x30 (-6494532107.403748)31 feature x31 (-6494532107.405770)32 feature x32 (-6494532107.411852)33 feature x33 (-6494532107.469144)34 feature x34 (-6494532107.470899)35 feature x35 (-6494532107.471970)36 feature x36 (-6494532107.489899)37 feature x37 (-6494532107.495930)38 feature x38 (-6494532107.504712)39 feature x39 (-6494532107.522346)40 feature x40 (-6494532107.557917)41 feature x41 (-6494532107.561793)42 feature x42 (-6494532107.562286)43 feature x43 (-6494532107.601017)44 feature x44 (-6494532107.603461)45 feature x45 (-6494532107.686674)46 feature x46 (-6494532107.843128)47 feature x47 (-6494532107.910987)
截距为:0.555702083558模型得分为:0.79
当我移除items_total后,我得到了更合理的系数:
1 feature x2 2.1405822 feature x3 1.7323283 feature x4 1.7016614 feature x5 1.6074655 feature x6 1.5491966 feature x7 1.5262277 feature x8 1.3630678 feature x9 1.3292259 feature x10 1.1810910 feature x11 1.01063911 feature x12 0.97812312 feature x13 0.78256913 feature x14 0.77316414 feature x15 0.74747915 feature x16 0.63874316 feature x17 0.61708217 feature x18 0.6125718 feature x19 0.59366519 feature x20 0.56530920 feature x21 0.48010521 feature x22 0.39659222 feature x23 0.37367523 feature x24 0.36964324 feature x25 0.36598925 feature x26 0.35097126 feature x27 0.34338127 feature x28 0.3415828 feature x29 0.31440529 feature x30 0.28534430 feature x31 0.28282731 feature x32 0.27700732 feature x33 0.21972733 feature x34 0.21781434 feature x35 0.21746635 feature x36 0.19852636 feature x37 0.19327737 feature x38 0.18433238 feature x39 0.16674539 feature x40 0.13065540 feature x41 0.12757341 feature x42 0.12666542 feature x43 0.08737143 feature x44 0.08554544 feature x45 0.00304545 feature x46 (-0.153778)46 feature x47 (-0.221548)
模型的截距和得分相同。有人能帮我理解为什么移除items_total列后系数会如此不同吗?
回答:
我认为这主要是一个理论问题。最好在https://stats.stackexchange.com/或https://datascience.stackexchange.com/上提问
这被称为多重共线性。
我将提供一个更好的例子来演示这个问题,这个例子在维基百科的俄文页面上可以找到:假设你有以下特征:x1
、x2
、x3
,其中x1 = x2+x3
所以我们有一个模型,看起来像这样。
因此,我们在随机修改系数后得到了相同的模型,这就是问题所在。因此,你应该避免特征之间存在如此强的相关性(你的最后一个特征与所有其他特征相关)。