我是一个在机器学习和Python编程方面完全的初学者,我被要求从头开始编写逻辑回归代码,以便了解幕后发生的事情。到目前为止,我已经编写了假设函数、成本函数和梯度下降的代码,然后编写了逻辑回归的代码。然而,在编写输出准确率的代码时,我得到的输出很低(0.69),并且无论增加迭代次数还是改变学习率,输出都没有变化。我的问题是,我的准确率代码是否有问题?任何能指引我正确方向的帮助都将不胜感激
X = data[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']]X = np.array(X)X = min_max_scaler.fit_transform(X)Y = data["diagnosis"].map({'M':1,'B':0})Y = np.array(Y)X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)X = data["diagnosis"].map(lambda x: float(x))def Sigmoid(z): if z < 0: return 1 - 1/(1 + math.exp(z)) else: return 1/(1 + math.exp(-z))def Hypothesis(theta, x): z = 0 for i in range(len(theta)): z += x[i]*theta[i] return Sigmoid(z)def Cost_Function(X,Y,theta,m): sumOfErrors = 0 for i in range(m): xi = X[i] hi = Hypothesis(theta,xi) error = Y[i] * math.log(hi if hi >0 else 1) if Y[i] == 1: error = Y[i] * math.log(hi if hi >0 else 1) elif Y[i] == 0: error = (1-Y[i]) * math.log(1-hi if 1-hi >0 else 1) sumOfErrors += error constant = -1/m J = constant * sumOfErrors #print ('cost is: ', J ) return Jdef Cost_Function_Derivative(X,Y,theta,j,m,alpha): sumErrors = 0 for i in range(m): xi = X[i] xij = xi[j] hi = Hypothesis(theta,X[i]) error = (hi - Y[i])*xij sumErrors += error m = len(Y) constant = float(alpha)/float(m) J = constant * sumErrors return Jdef Gradient_Descent(X,Y,theta,m,alpha): new_theta = [] constant = alpha/m for j in range(len(theta)): CFDerivative = Cost_Function_Derivative(X,Y,theta,j,m,alpha) new_theta_value = theta[j] - CFDerivative new_theta.append(new_theta_value) return new_thetadef Accuracy(theta): correct = 0 length = len(X_test, Hypothesis(X,theta)) for i in range(length): prediction = round(Hypothesis(X[i],theta)) answer = Y[i] if prediction == answer.all(): correct += 1 my_accuracy = (correct / length)*100 print ('LR Accuracy %: ', my_accuracy)def Logistic_Regression(X,Y,alpha,theta,num_iters): theta = np.zeros(X.shape[1]) m = len(Y) for x in range(num_iters): new_theta = Gradient_Descent(X,Y,theta,m,alpha) theta = new_theta if x % 100 == 0: Cost_Function(X,Y,theta,m) print ('theta: ', theta) print ('cost: ', Cost_Function(X,Y,theta,m)) Accuracy(theta)initial_theta = [0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] alpha = 0.0001iterations = 1000Logistic_Regression(X,Y,alpha,initial_theta,iterations)
这是使用威斯康星乳腺癌数据集(https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)的数据,我考虑了30个特征 – 尽管更改为已知相关的特征也没有改变我的准确率。
回答:
我不确定你是如何得出alpha
的值为0.0001
的,但我认为这个值太低了。使用你的代码和癌症数据显示,每次迭代成本都在下降——只是下降得非常慢。
当我将这个值提高到0.5时,成本仍然在下降,但速度更合理。经过1000次迭代后,它报告如下:
cost: 0.23668000993020666
在修复Accuracy
函数后,我在数据的测试部分得到了92%的准确率。
你已经安装了Numpy,如X = np.array(X)
所示。你真的应该考虑使用它来进行你的操作。对于这样的任务,它会快得多。这里是一个向量化的版本,可以立即给出结果,而不是等待:
import mathimport numpy as npimport matplotlib.pyplot as pltimport pandas as pdfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.model_selection import train_test_splitdf = pd.read_csv("cancerdata.csv")X = df.values[:,2:-1].astype('float64')X = (X - np.mean(X, axis =0)) / np.std(X, axis = 0)## Add a bias column to the dataX = np.hstack([np.ones((X.shape[0], 1)),X])X = MinMaxScaler().fit_transform(X)Y = df["diagnosis"].map({'M':1,'B':0})Y = np.array(Y)X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)def Sigmoid(z): return 1/(1 + np.exp(-z))def Hypothesis(theta, x): return Sigmoid(x @ theta) def Cost_Function(X,Y,theta,m): hi = Hypothesis(theta, X) _y = Y.reshape(-1, 1) J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi)) return Jdef Cost_Function_Derivative(X,Y,theta,m,alpha): hi = Hypothesis(theta,X) _y = Y.reshape(-1, 1) J = alpha/float(m) * X.T @ (hi - _y) return Jdef Gradient_Descent(X,Y,theta,m,alpha): new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha) return new_thetadef Accuracy(theta): correct = 0 length = len(X_test) prediction = (Hypothesis(theta, X_test) > 0.5) _y = Y_test.reshape(-1, 1) correct = prediction == _y my_accuracy = (np.sum(correct) / length)*100 print ('LR Accuracy %: ', my_accuracy)def Logistic_Regression(X,Y,alpha,theta,num_iters): m = len(Y) for x in range(num_iters): new_theta = Gradient_Descent(X,Y,theta,m,alpha) theta = new_theta if x % 100 == 0: #print ('theta: ', theta) print ('cost: ', Cost_Function(X,Y,theta,m)) Accuracy(theta)ep = .012initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - epalpha = 0.5iterations = 2000Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)
我认为我可能使用了不同版本的scikit,因为我不得不更改MinMaxScaler
行才能使其工作。结果是,我可以在眨眼之间完成10K次迭代,并且将模型应用于测试集的结果大约是97%的准确率。