SGDRegressor() 持续无法提升验证性能

我的 SGDRegressor 模型在训练约 20,000 条记录后，其在验证集 (test) 上的表现不再增加或减少。即使我尝试切换 penalty、early_stopping (True/False) 或将 alpha、eta0 设置为极高或极低的水平，“卡住”的验证分数 test 行为也没有变化。

我在训练前使用了 StandardScaler 并对训练集和测试集进行了洗牌处理。

train_test_split(X,y, test_size = 0.3, random_state=85, shuffle=True)print(X_train.shape, X_test.shape)print(y_train.shape, y_test.shape)>>>(336144, 10) (144063, 10)>>>(336144,) (144063,)

我的验证代码是否有问题，还是因为 SGDRegressor 在处理训练数据时存在限制导致这种行为是可以解释的？

from sklearn.linear_model import SGDRegressorfrom sklearn.metrics import mean_squared_errorimport pandasimport matplotlib.pyplot as pltscores_test = []scores_train = []my_rng = range(10,len(X_train),30000)for m in my_rng:    print(m)    modelSGD = SGDRegressor(alpha=0.00001, penalty='l1')    modelSGD.fit(X_train[:m], y_train[:m])        ypred_train = modelSGD.predict(X_train[:m])    ypred_test = modelSGD.predict(X_test)    mse_train = mean_squared_error(y_train[:m], ypred_train)    mse_test = mean_squared_error(y_test, ypred_test)    scores_train.append(mse_train)    scores_test.append(mse_test)

如何“强迫” SGDRegressor 接受更多的训练数据，并改变其在 test 数据上的表现？

编辑:我试图可视化模型在训练了 30,000 或 300,000 条记录后，其在 test 上的分数没有变化。这就是为什么我在循环中初始化 SGDRegressor，以便每次迭代中它都被完全重新训练的原因。

正如 @Nikaido 所问，这些是模型在拟合后的 coef_、intercept_：

trainsize: 10, coef:  [ 0.81815135  2.2966633   1.61231584 -0.00339933 -3.03094922  0.12757874  -2.60874563  1.52383531  0.3250487  -0.61251297], intercept:  [50.77553038]trainsize: 30010, coef:  [ 0.19097587 -0.35854903 -0.16142221  0.11281925 -0.66771756  0.55912533   0.90462141 -1.417289    0.50487032 -1.42423654], intercept:  [83.28458307]trainsize: 60010, coef:  [ 0.09848169 -0.1362008  -0.15825232 -0.4401373   0.31664536  0.04960247  -0.37299047  0.6641436   0.02782047 -1.15355052], intercept:  [80.87163096]trainsize: 90010, coef:  [-0.00923631  0.5845441   0.28485334 -0.29528061 -0.30643056  1.20320208   1.9723999  -0.47707621  1.25355186 -2.04990825], intercept:  [85.17812028]trainsize: 120010, coef:  [-0.04959943 -0.15744169 -0.17071373 -0.20829149 -1.38683906  2.18572481   1.43380752 -1.48133799  2.18962484 -3.41135224], intercept:  [86.40188522]trainsize: 150010, coef:  [ 0.56190926  0.05052168  0.22624504  0.55751301 -0.50829818  1.27571154   1.49847285 -0.15134682  1.30017967 -0.88259823], intercept:  [83.69264344]trainsize: 180010, coef:  [ 0.17765624  0.1137466   0.15081498 -0.51520765 -1.00811419 -0.13203398   1.28565565 -0.03594421 -0.08053252 -2.31793746], intercept:  [85.21824705]trainsize: 210010, coef:  [-0.53937513 -0.33872786 -0.44854466  0.70039384 -0.77073389  0.4361326   0.88175392 -0.32460908  0.5141777  -1.5123801 ], intercept:  [82.75353293]trainsize: 240010, coef:  [ 0.70748011 -0.08992019  0.25365326  0.61999278 -0.29374005  0.25833863  -0.00485613 -0.21211637  0.19286126 -1.09503691], intercept:  [85.76414815]trainsize: 270010, coef:  [ 0.73787648  0.30155102  0.44013832 -0.2355825   0.26255699  1.55410066   0.4733571   0.85352683  1.4399516  -1.73360843], intercept:  [84.19473044]trainsize: 300010, coef:  [ 0.04861321 -0.35446415 -0.17774692 -0.1060901  -0.5864299   1.03429399   0.57160049 -0.13900199  1.09189946 -1.26298814], intercept:  [83.14797646]trainsize: 330010, coef:  [ 0.20214825  0.22605839  0.17022397  0.28191112 -1.05982574  0.74025932   0.04981973 -0.27232538  0.72094765 -0.94875017], intercept:  [81.97656309]

编辑2:@Nikaido 要求提供数据分布。训练集和测试集特征的分布非常相似，这是由于原始值是分类（范围在1-9之间）或分解的时间戳（如月份数、星期几、小时、分钟）。labels 图表显示在100附近缺乏正态分布。其原因是：缺失值已被每个类别的全局平均值（介于80到95之间）替换。

此外，我创建了一个图表，展示了通过更改上述代码片段生成的验证放大图：

my_rng = range(1000,len(X_train)-200000,2000)

可以看到 SGD 典型的在最优点附近跳动。但是，无论如何，随着训练集记录的增加，测试分数的趋势没有任何显著变化。

回答：

编辑：关于你的输出，我的猜测是，由于像 SGDregressor 这样的线性模型倾向于在复杂数据上欠拟合，所以你的验证集结果非常接近

你可以检查模型在每次迭代中输出的权重。你会发现它们是相同的或非常接近

为了增加输出的变异性，你需要引入非线性和复杂性

你得到的是机器学习中所说的“偏差”（与“方差”相对）

我想我现在明白了

SamAmani 最后我认为问题是欠拟合。以及你使用数据集的增量大小的事实。模型很快就欠拟合了（这意味着模型在一开始就卡在一个或多或少固定的模型上）

只有第一次训练输出对测试集有不同的结果，因为它还没有达到最终模型，更或多或少

潜在的变异性在于增量训练集。简单来说，测试结果是对欠拟合模型性能的更准确估计。增加训练样本最终会导致测试和训练结果接近，但不会有太大改善

你可以检查训练的增量数据集与测试集是不同的。你做错的是检查了整个训练集的统计数据

学技术

SGDRegressor() 持续无法提升验证性能

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复