为什么我的自定义线性回归模型与sklearn不匹配?

我正在尝试使用Python创建一个简单的线性模型,不使用任何库(除了numpy)。这是我目前的代码:

import numpy as npimport pandasnp.random.seed(1)alpha = 0.1def h(x, w):  return np.dot(w.T, x)def cost(X, W, Y):  totalCost = 0  for i in range(47):    diff = h(X[i], W) - Y[i]    squared = diff * diff    totalCost += squared  return totalCost / 2housing_data = np.loadtxt('Housing.csv', delimiter=',')x1 = housing_data[:,0]x2 = housing_data[:,1]y = housing_data[:,2]avgX1 = np.mean(x1)stdX1 = np.std(x1)normX1 = (x1 - avgX1) / stdX1print('avgX1', avgX1)print('stdX1', stdX1)avgX2 = np.mean(x2)stdX2 = np.std(x2)normX2 = (x2 - avgX2) / stdX2print('avgX2', avgX2)print('stdX2', stdX2)normalizedX = np.ones((47, 3))normalizedX[:,1] = normX1normalizedX[:,2] = normX2np.savetxt('normalizedX.csv', normalizedX)weights = np.ones((3,))for boom in range(100):  currentCost = cost(normalizedX, weights, y)  if boom % 1 == 0:    print(boom, 'iteration', weights[0], weights[1], weights[2])    print('Cost', currentCost)  for i in range(47):    errorDiff = h(normalizedX[i], weights) - y[i]    weights[0] = weights[0] - alpha * (errorDiff) * normalizedX[i][0]    weights[1] = weights[1] - alpha * (errorDiff) * normalizedX[i][1]    weights[2] = weights[2] - alpha * (errorDiff) * normalizedX[i][2]print(weights)predictedX = [1, (2100 - avgX1) / stdX1, (3 - avgX2) / stdX2]firstPrediction = np.array(predictedX)print('firstPrediction', firstPrediction)firstPrediction = h(firstPrediction, weights)print(firstPrediction)

首先,它非常快地收敛了,仅用了14次迭代。其次,它给出的结果与使用sklearn的线性回归不同。供参考,我的sklearn代码如下:

import numpyimport matplotlib.pyplot as plotimport pandasimport sklearnfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressiondataset = pandas.read_csv('Housing.csv', header=None)x = dataset.iloc[:, :-1].valuesy = dataset.iloc[:, 2].valueslinearRegressor = LinearRegression()xnorm = sklearn.preprocessing.scale(x)scaleCoef = sklearn.preprocessing.StandardScaler().fit(x)mean = scaleCoef.mean_std = numpy.sqrt(scaleCoef.var_)print('stf')print(std)stuff = linearRegressor.fit(xnorm, y)predictedX = [[(2100 - mean[0]) / std[0], (3 - mean[1]) / std[1]]]yPrediction = linearRegressor.predict(predictedX)print('predictedX', predictedX)print('predict', yPrediction)print(stuff.coef_, stuff.intercept_)

我的自定义模型预测y的值为337,000,而sklearn预测为355,000。我的数据有47行,格式如下:

2104,3,3.999e+051600,3,3.299e+052400,3,3.69e+051416,2,2.32e+053000,4,5.399e+051985,4,2.999e+051534,3,3.149e+05

完整数据可在https://github.com/shamoons/linear-logistic-regression/blob/master/Housing.csv获取

我假设原因可能是(a)我的梯度下降回归方法有问题,或者(b)我没有正确使用sklearn

还有其他原因会导致这两个模型对给定输入预测出不同的输出吗?


回答:

我认为你在梯度下降中缺少了1/m项(其中m是y的大小)。在加入1/m项后,我得到的预测值与你的sklearn代码相似。

请看下面的代码:

....weights = np.ones((3,))m = y.sizefor boom in range(100):  currentCost = cost(normalizedX, weights, y)  if boom % 1 == 0:    print(boom, 'iteration', weights[0], weights[1], weights[2])    print('Cost', currentCost)  for i in range(47):    errorDiff = h(normalizedX[i], weights) - y[i]    weights[0] = weights[0] - alpha *(1/m)* (errorDiff) * normalizedX[i][0]    weights[1] = weights[1] - alpha *(1/m)*  (errorDiff) * normalizedX[i][1]    weights[2] = weights[2] - alpha *(1/m)* (errorDiff) * normalizedX[i][2]...

这使得首次预测值为355242。

这与线性回归模型的预测结果非常吻合,尽管它不使用梯度下降。

我还尝试了sklearn中的sgdregressor(使用随机梯度下降),它似乎也得到了与线性回归模型和你的模型接近的值。请看下面的代码:

import numpyimport matplotlib.pyplot as plotimport pandasimport sklearnfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegression, SGDRegressordataset = pandas.read_csv('Housing.csv', header=None)x = dataset.iloc[:, :-1].valuesy = dataset.iloc[:, 2].valuessgdRegressor = SGDRegressor(penalty='none', learning_rate='constant', eta0=0.1, max_iter=1000, tol = 1E-6)xnorm = sklearn.preprocessing.scale(x)scaleCoef = sklearn.preprocessing.StandardScaler().fit(x)mean = scaleCoef.mean_std = numpy.sqrt(scaleCoef.var_)print('stf')print(std)yPrediction = []predictedX = [[(2100 - mean[0]) / std[0], (3 - mean[1]) / std[1]]]print('predictedX', predictedX)for trials in range(10):    stuff = sgdRegressor.fit(xnorm, y)    yPrediction.extend(sgdRegressor.predict(predictedX))print('predict', np.mean(yPrediction))

结果为

predict 355533.10119985335

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注