不同的Python最小化函数给出不同的值，为什么？

我正在通过重写Andrew Ng的机器学习课程作业来学习Python，这些作业原本是用Octave编写的（我已经上过这门课并获得了证书）。我在优化函数上遇到了问题。在课程中，他们使用了fmincg，这是一个在Octave中用于最小化线性回归的成本函数（凸函数）的函数，并提供了其导数。他们还教你如何使用梯度下降和正规方程，理论上，如果使用正确，这些方法都应该给出相同的结果（在几位小数内）。它们在线性回归中表现得很好，我在Python中也得到了相同的结果。明确地说，我正在尝试最小化成本函数以找到数据集的最佳拟合参数（theta）。到目前为止，我使用了‘nelder-mead’方法，它不需要导数，并且给出了与他们最接近的解决方案。我还尝试了‘TNC’、‘CG’和‘BFGS’，这些都需要导数来最小化函数。当我使用一阶多项式（线性）时，它们都表现得很好，但当我将多项式的阶数增加到非线性时，在我的情况下是从x^1到x^8，我就无法使我的函数拟合数据集。我正在做的练习非常简单，我有12个数据点，因此使用8阶多项式应该能捕捉到每一个点（如果你好奇的话，这是一个高方差的例子，即过拟合数据）。他们展示的解决方案是一条穿过所有数据点的线，正如预期的那样，捕捉到了所有点。我得到的最好结果是使用‘nelder-mead’方法时，它只捕捉到了数据集中的两个点，而其余的最小化函数甚至没有给我接近我所寻找的结果。我不确定哪里出了问题，因为我的成本函数和梯度在线性情况下给出了正确的值，所以我假设它们在工作（Octave的精确答案）。

我将列出Octave和Python中的函数，希望有人能解释我为什么得到不同的答案。或者指出我没有看到的明显错误。

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear %regression with multiple variables%   [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the %   cost of using theta as the parameter for linear regression to fit the %   data points in X and y. Returns the cost in J and the gradient in gradm = length(y); % number of training examples J = 0;grad = zeros(size(theta));htheta = X * theta;n = size(theta);J = 1 / (2 * m) * sum((htheta - y) .^ 2) + lambda / (2 * m) * sum(theta(2:n) .^ 2);grad = 1 / m * X' * (htheta - y);grad(2:n) = grad(2:n) + lambda / m * theta(2:n); # we leave the bias nice grad = grad(:);end

这是我的代码片段，如果有人想要完整的代码，我也可以提供：

def costFunction(theta, Xcost, y, lmda):    m = len(y)    theta = theta.reshape((len(theta),1))    htheta = np.dot(Xcost,theta) - y     J = 1 / (2 * m) * np.dot(htheta.T,htheta) + lmda / (2 * m) * np.sum(theta[1:,:]**2)    return Jdef gradCostFunc(gradtheta, X, y, lmda):    m = len(y)    gradtheta = gradtheta.reshape((len(gradtheta),1))    hgradtheta = np.dot(X,gradtheta) - y     #gradtheta[0,0] = 0.     grad = (1 / m) * np.dot(X.T, hgradtheta)    #for i in range(1,len(grad)):    grad[1:,0] = grad[1:,0] + (lmda/m) * gradtheta[1:,0]    return grad.reshape((len(grad)))def normalEqn(X, y, lmda):    e = np.eye(X.shape[1])    e[0,0] = 0    theta = np.dot(np.linalg.pinv(np.dot(X.T,X) + lmda * e),np.dot(X.T,y))    return theta def gradientDescent(X, y, theta, alpha, lmda, num_iters):    # calculate gradient descent in an iterative manner    m = len(y)    # J_history tracks the evolution of the cost function     J_history = np.zeros((num_iters,1))    # Calculating the gradients     for i in range(0, num_iters):        grad = np.zeros((len(theta),1))        grad = gradCostFunc(theta, X, y, lmda)        #updating the thetas         theta = theta - alpha * grad         J_history[i] = costFunction(theta, X, y, lmda)    plt.plot(J_history)    plt.show()    return theta def trainLR(initheta, X, y, lmda):    #print theta.shape, X.shape, y.shape, gradtest.shape gradCostFunc    options = {'maxiter': 1000}    res = optimize.minimize(costFunction, initheta, jac=gradCostFunc, method='CG',                            args=(X, y, lmda), options = options)    #res = optimize.minimize(costFunction, theta, method='nelder-mead',                             args=(X,y,lmda), options={'disp': False})    #res = optimize.fmin_bfgs(costFunction, theta, fprime=gradCostFunc, args=(X, y, lmda))    return res.xdef polyFeatures(X, degree):    # map the higher polynomials     out = X     if degree >= 2:        for i in range(2,degree+1):            out = np.column_stack((out,X**i))        return out     else:        return outdef featureNormalize(X):    # Since the values will vary by orders of magnitudes     # It’s important to normalize the various features     mu = np.mean(X, axis=0)    S1 = np.std(X, axis=0)    return mu, S1, (X - mu)/S1

这是这些函数的主要调用：

X, y, Xval, yval, Xtest, ytest = loadData('ex5data1.mat')X_poly = X # to be used in the later on in the program p = 8 X_poly = polyFeatures(X_poly, p)mu, sigma, X_poly = featureNormalize(X_poly)X_poly = padding(X_poly)theta = np.zeros((X_poly.shape[1],1))theta = trainLR(theta, X_poly, y, 0.)#theta = normalEqn(X_poly, y, 0.)#theta = gradientDescent(X_poly, y, theta, 0.1, 0, 1500)

回答：

我的回答可能偏离了重点，因为你的问题是寻求帮助调试你当前的实现。

不过，如果你对在Python中使用现成的优化器感兴趣，可以看看OpenOpt。该库包含了针对各种优化问题的性能合理优化的实现。

我还应该提到，scikit-learn库为Python提供了很好的机器学习工具集。

学技术

不同的Python最小化函数给出不同的值，为什么？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复