为什么我的MSE值如此之高,尽管测试值与预测值之间的差异很小?

在Python中,我进行了一个小型的多元线性回归模型来解释基于其他变量(所有变量都是乘以100的百分比)的地区房价,例如地区内拥有学士学位的人的百分比,以及在家工作的人的百分比。我在R中进行了这个模型,运行良好,但我对Python的机器学习还不熟悉。我展示了y_pred = regressor.predict(X_test)的输出以及我得到的MSE值。我还提供了一份数据样本,其中avgincomePctSingleDetachedPctDrivetoWork是X变量,而AvgHousingPrice是Y变量。

import matplotlib.pyplot as plt import pandas as pd from sklearn.impute import SimpleImputersample data:      avgincome     PctSingleDetached   PctDrivetoWork    AvgHousingPrice 0      44388.0          61.528497       81.151832          448954   1      40650.0          54.372197       77.882798          349758  2      43350.0          68.393782       79.553265          428740X = hamiltondata.iloc[:, :-1].valuesY = hamiltondata.iloc[:, -1].valuesimputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') # This is an object of the imputer class. It will help us find that average to infer.                          # Instructs to find missing and replace it with mean# Fit method in SimpleImputer will connect imputer to our matrix of features                       imputer.fit(X[:,:]) # We exclude column "O" AKA Country because they are stringsX[:, :] = imputer.transform(X[:,:])# from sklearn.compose import ColumnTransformer# from sklearn.preprocessing import OneHotEncoder# ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')# X = np.array(ct.fit_transform(X))print(X)print(Y)## Splitting into training and testing ##from sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)### Feature Scaling ###from sklearn.preprocessing import StandardScalersc = StandardScaler() # this does STANDARDIZATION for you. See data standardization formulaX_train[:, 0:] = sc.fit_transform(X_train[:,0:])# Fit changes the data, Transform applies it! Here we have a method that does bothX_test[:, 0:] = sc.transform(X_test[:, 0:]) print(X_train)print(X_test)## Training ## from sklearn.linear_model import LinearRegression regressor = LinearRegression() # This class takes care of selecting the best variables. Very convenientregressor.fit(X_train, Y_train)### Predicting Test Set results ###y_pred = regressor.predict(X_test)np.set_printoptions(precision = 2) # Display any numerical value with only 2 numebrs after decimalprint(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1 )), axis=1)) # this just simply makes everything verticalfrom sklearn.metrics import mean_squared_error mse = mean_squared_error(Y_test, y_pred)print(mse)OUTPUT: [[489066.76 300334.  ] [227458.2  200352.  ] [928249.59 946729.  ] [339032.27 350116.  ] [689668.21 600322.  ] [489179.58 577936.  ]]......MSE = 2375985640.8102403

回答:

你可以自己计算MSE来检查是否有问题。在我看来,得到的结果是合理的。不管怎样,我构建了一个简单的my_mse函数来检查sklearn输出的结果,使用你的示例数据

from sklearn.metrics import mean_squared_error list_ = [[489066.76, 300334.], [227458.2,  200352.  ],[928249.59, 946729.  ],[339032.27, 350116.  ],[689668.21, 600322.  ],[489179.58, 577936.  ]]y_true = [y[0] for y in list_]y_pred = [y[1] for y in list_]mse = mean_squared_error(y_true, y_pred)print(mse)# 8779930962.14985def my_mse(y_true, y_pred):  diff = 0  for couple in zip(y_true, y_pred):    diff+=pow(couple[0]-couple[1], 2)  return diff/len(y_true)print(my_mse(y_true, y_pred))# 8779930962.14985

请记住,MSE是均方误差。(每个误差在求和时都会被平方)

如果你在问你的模型是好是坏,这取决于你的主要目标。不管怎样,我认为你的模型表现不佳是因为它是一个线性模型。一个更复杂的模型可能能够处理这个问题并输出更好的结果

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注