我是否正确地测量了我的多元线性回归模型的性能？

这个问题可能有点傻（而且可能是琐碎的问题），但我是机器学习的新手。从我编写的代码中可以很容易地推断出这一点，这并不是为提问不当找借口。如果您认为这个问题提得不当，请告诉我，以便我可以更新它。

我训练了一个多元线性回归模型，我想看看它在给定数据集上的表现如何。所以，我在网上搜索了一番，发现了一篇很好的文章，解释了如何找出预测值与真实值之间的“误差”。文章提供了几种选项：

我应用了所有这些方法，它们给出了非常高的数值，所以我不知道这些结果是否正确，或者我应该如何解释它们。

文章接收到的输出：

10.0
150.0
12.2474487139

我的模型接收到的输出：

7514.293659640891
83502864.03257468
9137.990152794797

作为快速参考，这些是我的真实/预测值

简而言之的问题： 我使用上述方法测量误差是否正确，这些结果是否意味着我的模型表现得非常差？（当我将预测值与真实值进行比较时，似乎并不像这样）

你可以在这里查看我使用的数据集。

我用来创建模型和预测值的代码（我尝试删除了不必要的代码）

# Importing the librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pdfrom sklearn.preprocessing import LabelEncoder, OneHotEncoderfrom sklearn.linear_model import LinearRegressionfrom sklearn import metricsdataset = pd.read_csv('50_Startups.csv')X = dataset.iloc[:, :-1].values # Independent variablesy = dataset.iloc[:, 4].values # Dependent variable# Encode categorical data into numerical values (1, 2, 3)# For example; New york becomes 1 and Florida becomes 2labelencoder_states = LabelEncoder()# We just want to apply this to the state column, since this has categorical datastates_encoded = labelencoder_states.fit_transform(X[:, 3])# Update the states with the new encoded dataX[:, 3] = states_encoded# Now that we have the categories as numerical data, # we can split them into multiple dummy variables:# Split the categories into columns (more optimal)# Tell it too look at the state columnonehotencoder_states = OneHotEncoder(categorical_features = [3])# Actually transforms them into columnsX = onehotencoder_states.fit_transform(X).toarray()# Avoiding the Dummy Variable Trap# Remove the first column from X# Since; dummy variables -1X = X[:, 1:]# Splitting the dataset into the Training set and Test set# In this case we are going to use 40 of the 50 records for training# and ten of the 50 for testing, hence the 0.2 split ratiofrom sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)# Create a regressorregressor = LinearRegression()# Fit the model to the training dataregressor.fit(X_train, y_train)# Make predictions on the test set, using our modely_pred = regressor.predict(X_test)# Evaluating the model (Am I doing this correct?)# How well did it do?print(metrics.mean_absolute_error(y_test, y_pred))print(metrics.mean_squared_error(y_test, y_pred))print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

回答：

让我们来回答这个问题：我认为你在测量（至少在代码上）是正确的。但是：

谁告诉你关系是线性的？你是在尝试预测利润（对吗）？我认为线性回归可能不会很好地工作。所以我不惊讶你没有得到好的结果。
为了了解你的预测效果如何，尝试绘制预测值与真实值的图表，并检查你的点在线上的情况如何。

总结：你得到的大值并不意味着你的代码有问题。很可能关系不是线性的。

另外要注意：使用分类变量可能是问题的来源。你有没有尝试过不使用州来进行线性回归？你的结果是什么？在你的回归中，哪些变量是最重要的？你应该检查这一点。你的R平方是多少？

希望这对你有帮助，Umberto

学技术

我是否正确地测量了我的多元线性回归模型的性能？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复