我正在尝试预测波士顿房价。当我选择一阶或二阶的多项式回归时,R2分数还可以。但当选择三阶时,R2分数反而下降了。
# Importing the librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd# Importing the datasetfrom sklearn.datasets import load_bostonboston_dataset = load_boston()dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)dataset['MEDV'] = boston_dataset.targetX = dataset.iloc[:, 0:13].valuesy = dataset.iloc[:, 13].values.reshape(-1,1)# Splitting the dataset into the Training set and Test setfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)# Fitting Linear Regression to the datasetfrom sklearn.linear_model import LinearRegression# Fitting Polynomial Regression to the datasetfrom sklearn.preprocessing import PolynomialFeaturespoly_reg = PolynomialFeatures(degree = 2) # <-- Tuning to 3X_poly = poly_reg.fit_transform(X_train)poly_reg.fit(X_poly, y_train)lin_reg_2 = LinearRegression()lin_reg_2.fit(X_poly, y_train)y_pred = lin_reg_2.predict(poly_reg.fit_transform(X_test))from sklearn.metrics import r2_scoreprint('Prediction Score is: ', r2_score(y_test, y_pred))
输出(阶数=2):
Prediction Score is: 0.6903318065831567
输出(阶数=3):
Prediction Score is: -12898.308114085281
回答:
这被称为模型过拟合。你所做的是让模型完美地适应训练集,这将导致高方差。当你的假设在训练集上拟合得很好时,它在测试集上的表现就会变差。你可以使用r2_score(X_train,y_train)
来检查你的训练集的R2分数,它会很高。你需要在偏差和方差之间找到平衡点。
你可以尝试其他回归模型,比如Lasso和Ridge,并调整它们的alpha值,如果你希望获得更高的R2分数。为了更好地理解,我放了一张图片,展示了随着多项式阶数增加,假设线是如何受到影响的。