在Python中,我进行了一个小型的多元线性回归模型来解释基于其他变量(所有变量都是乘以100的百分比)的地区房价,例如地区内拥有学士学位的人的百分比,以及在家工作的人的百分比。我在R中进行了这个模型,运行良好,但我对Python的机器学习还不熟悉。我展示了y_pred = regressor.predict(X_test)
的输出以及我得到的MSE值。我还提供了一份数据样本,其中avgincome
、PctSingleDetached
和PctDrivetoWork
是X变量,而AvgHousingPrice
是Y变量。
import matplotlib.pyplot as plt import pandas as pd from sklearn.impute import SimpleImputersample data: avgincome PctSingleDetached PctDrivetoWork AvgHousingPrice 0 44388.0 61.528497 81.151832 448954 1 40650.0 54.372197 77.882798 349758 2 43350.0 68.393782 79.553265 428740X = hamiltondata.iloc[:, :-1].valuesY = hamiltondata.iloc[:, -1].valuesimputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') # This is an object of the imputer class. It will help us find that average to infer. # Instructs to find missing and replace it with mean# Fit method in SimpleImputer will connect imputer to our matrix of features imputer.fit(X[:,:]) # We exclude column "O" AKA Country because they are stringsX[:, :] = imputer.transform(X[:,:])# from sklearn.compose import ColumnTransformer# from sklearn.preprocessing import OneHotEncoder# ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')# X = np.array(ct.fit_transform(X))print(X)print(Y)## Splitting into training and testing ##from sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)### Feature Scaling ###from sklearn.preprocessing import StandardScalersc = StandardScaler() # this does STANDARDIZATION for you. See data standardization formulaX_train[:, 0:] = sc.fit_transform(X_train[:,0:])# Fit changes the data, Transform applies it! Here we have a method that does bothX_test[:, 0:] = sc.transform(X_test[:, 0:]) print(X_train)print(X_test)## Training ## from sklearn.linear_model import LinearRegression regressor = LinearRegression() # This class takes care of selecting the best variables. Very convenientregressor.fit(X_train, Y_train)### Predicting Test Set results ###y_pred = regressor.predict(X_test)np.set_printoptions(precision = 2) # Display any numerical value with only 2 numebrs after decimalprint(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1 )), axis=1)) # this just simply makes everything verticalfrom sklearn.metrics import mean_squared_error mse = mean_squared_error(Y_test, y_pred)print(mse)OUTPUT: [[489066.76 300334. ] [227458.2 200352. ] [928249.59 946729. ] [339032.27 350116. ] [689668.21 600322. ] [489179.58 577936. ]]......MSE = 2375985640.8102403
回答:
你可以自己计算MSE来检查是否有问题。在我看来,得到的结果是合理的。不管怎样,我构建了一个简单的my_mse函数来检查sklearn输出的结果,使用你的示例数据
from sklearn.metrics import mean_squared_error list_ = [[489066.76, 300334.], [227458.2, 200352. ],[928249.59, 946729. ],[339032.27, 350116. ],[689668.21, 600322. ],[489179.58, 577936. ]]y_true = [y[0] for y in list_]y_pred = [y[1] for y in list_]mse = mean_squared_error(y_true, y_pred)print(mse)# 8779930962.14985def my_mse(y_true, y_pred): diff = 0 for couple in zip(y_true, y_pred): diff+=pow(couple[0]-couple[1], 2) return diff/len(y_true)print(my_mse(y_true, y_pred))# 8779930962.14985
请记住,MSE是均方误差。(每个误差在求和时都会被平方)
如果你在问你的模型是好是坏,这取决于你的主要目标。不管怎样,我认为你的模型表现不佳是因为它是一个线性模型。一个更复杂的模型可能能够处理这个问题并输出更好的结果