如何计算逻辑回归的准确率

我是一个在机器学习和Python编程方面完全的初学者,我被要求从头开始编写逻辑回归代码,以便了解幕后发生的事情。到目前为止,我已经编写了假设函数、成本函数和梯度下降的代码,然后编写了逻辑回归的代码。然而,在编写输出准确率的代码时,我得到的输出很低(0.69),并且无论增加迭代次数还是改变学习率,输出都没有变化。我的问题是,我的准确率代码是否有问题?任何能指引我正确方向的帮助都将不胜感激

X = data[['radius_mean', 'texture_mean', 'perimeter_mean',   'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',   'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',   'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',   'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',   'fractal_dimension_se', 'radius_worst', 'texture_worst',   'perimeter_worst', 'area_worst', 'smoothness_worst',   'compactness_worst', 'concavity_worst', 'concave points_worst',   'symmetry_worst', 'fractal_dimension_worst']]X = np.array(X)X = min_max_scaler.fit_transform(X)Y = data["diagnosis"].map({'M':1,'B':0})Y = np.array(Y)X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)X = data["diagnosis"].map(lambda x: float(x))def Sigmoid(z):    if z < 0:        return 1 - 1/(1 + math.exp(z))    else:        return 1/(1 + math.exp(-z))def Hypothesis(theta, x):    z = 0    for i in range(len(theta)):        z += x[i]*theta[i]    return Sigmoid(z)def Cost_Function(X,Y,theta,m):    sumOfErrors = 0    for i in range(m):        xi = X[i]        hi = Hypothesis(theta,xi)        error = Y[i] * math.log(hi if  hi >0 else 1)        if Y[i] == 1:            error = Y[i] * math.log(hi if  hi >0 else 1)        elif Y[i] == 0:            error = (1-Y[i]) * math.log(1-hi  if  1-hi >0 else 1)        sumOfErrors += error    constant = -1/m    J = constant * sumOfErrors    #print ('cost is: ', J )     return Jdef Cost_Function_Derivative(X,Y,theta,j,m,alpha):    sumErrors = 0    for i in range(m):        xi = X[i]        xij = xi[j]        hi = Hypothesis(theta,X[i])        error = (hi - Y[i])*xij        sumErrors += error    m = len(Y)    constant = float(alpha)/float(m)    J = constant * sumErrors    return Jdef Gradient_Descent(X,Y,theta,m,alpha):    new_theta = []    constant = alpha/m    for j in range(len(theta)):        CFDerivative = Cost_Function_Derivative(X,Y,theta,j,m,alpha)        new_theta_value = theta[j] - CFDerivative        new_theta.append(new_theta_value)    return new_thetadef Accuracy(theta):    correct = 0    length = len(X_test, Hypothesis(X,theta))    for i in range(length):        prediction = round(Hypothesis(X[i],theta))        answer = Y[i]    if prediction == answer.all():            correct += 1    my_accuracy = (correct / length)*100    print ('LR Accuracy %: ', my_accuracy)def Logistic_Regression(X,Y,alpha,theta,num_iters):    theta = np.zeros(X.shape[1])    m = len(Y)    for x in range(num_iters):        new_theta = Gradient_Descent(X,Y,theta,m,alpha)        theta = new_theta        if x % 100 == 0:            Cost_Function(X,Y,theta,m)            print ('theta: ', theta)                print ('cost: ', Cost_Function(X,Y,theta,m))    Accuracy(theta)initial_theta = [0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]  alpha = 0.0001iterations = 1000Logistic_Regression(X,Y,alpha,initial_theta,iterations)

这是使用威斯康星乳腺癌数据集(https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)的数据,我考虑了30个特征 – 尽管更改为已知相关的特征也没有改变我的准确率。


回答:

我不确定你是如何得出alpha的值为0.0001的,但我认为这个值太低了。使用你的代码和癌症数据显示,每次迭代成本都在下降——只是下降得非常慢。

当我将这个值提高到0.5时,成本仍然在下降,但速度更合理。经过1000次迭代后,它报告如下:

cost:  0.23668000993020666

在修复Accuracy函数后,我在数据的测试部分得到了92%的准确率。

你已经安装了Numpy,如X = np.array(X)所示。你真的应该考虑使用它来进行你的操作。对于这样的任务,它会快得多。这里是一个向量化的版本,可以立即给出结果,而不是等待:

import mathimport numpy as npimport matplotlib.pyplot as pltimport pandas as pdfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.model_selection import train_test_splitdf = pd.read_csv("cancerdata.csv")X = df.values[:,2:-1].astype('float64')X = (X - np.mean(X, axis =0)) /  np.std(X, axis = 0)## Add a bias column to the dataX = np.hstack([np.ones((X.shape[0], 1)),X])X = MinMaxScaler().fit_transform(X)Y = df["diagnosis"].map({'M':1,'B':0})Y = np.array(Y)X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)def Sigmoid(z):    return 1/(1 + np.exp(-z))def Hypothesis(theta, x):       return Sigmoid(x @ theta) def Cost_Function(X,Y,theta,m):    hi = Hypothesis(theta, X)    _y = Y.reshape(-1, 1)    J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))    return Jdef Cost_Function_Derivative(X,Y,theta,m,alpha):    hi = Hypothesis(theta,X)    _y = Y.reshape(-1, 1)    J = alpha/float(m) * X.T @ (hi - _y)    return Jdef Gradient_Descent(X,Y,theta,m,alpha):    new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)    return new_thetadef Accuracy(theta):    correct = 0    length = len(X_test)    prediction = (Hypothesis(theta, X_test) > 0.5)    _y = Y_test.reshape(-1, 1)    correct = prediction == _y    my_accuracy = (np.sum(correct) / length)*100    print ('LR Accuracy %: ', my_accuracy)def Logistic_Regression(X,Y,alpha,theta,num_iters):    m = len(Y)    for x in range(num_iters):        new_theta = Gradient_Descent(X,Y,theta,m,alpha)        theta = new_theta        if x % 100 == 0:            #print ('theta: ', theta)                print ('cost: ', Cost_Function(X,Y,theta,m))    Accuracy(theta)ep = .012initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - epalpha = 0.5iterations = 2000Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)

我认为我可能使用了不同版本的scikit,因为我不得不更改MinMaxScaler行才能使其工作。结果是,我可以在眨眼之间完成10K次迭代,并且将模型应用于测试集的结果大约是97%的准确率。

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注