scikit分类器(朴素贝叶斯,决策树分类器)的准确率极低

我使用的数据集是基于年龄的财富数据,文档中提到准确率应该在84%左右。不幸的是,我的程序的准确率只有25%

为了处理数据,我做了以下步骤:

1. 加载.txt数据文件并将其转换为.csv格式
2. 删除缺失值的数据
3. 提取类别值:<=50K 和 >50K,并分别转换为0和1
4. 对于每个属性和该属性的每个字符串值,我将其映射到一个整数值。例如att1{'cs':0, 'cs2':1}, att2{'usa':0, 'greece':1}...依此类推
5. 在新的整数数据集上调用朴素贝叶斯

Python代码:

import load_csv as load #我的函数用于执行列表中的[1..5]步骤
import numpy as np
my_data = np.genfromtxt('out.csv', dtype = dt, delimiter = ',', skip_header = 1)
data = np.array(load.remove_missing_values(my_data))                     #这个函数删除缺失数据
features_train = np.array(load.remove_field_num(data, len(data[0]) - 1)) #这个函数提取数据,例如删除数据末尾的类别
label_train = np.array(load.create_labels(data))
features_train = np.array(load.convert_to_int(features_train))
my_data = np.genfromtxt('test.csv', dtype = dt, delimiter = ',', skip_header = 1)
data = np.array(load.remove_missing_values(my_data))
features_test = np.array(load.remove_field_num(data, len(data[0]) - 1))
label_test = np.array(load.create_labels(data))                          #从.csv数据文件中提取标签
features_test = np.array(load.convert_to_int(features_test))             #将字符串转换为整数(每个属性的每个唯一字符串被分配一个唯一整数值)
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.metrics import accuracy_score
clf = tree.DecisionTreeClassifier()
clf.fit(features_train, label_train)
predict = clf.predict(features_test)
score = accuracy_score(predict, label_test) #低准确率得分

load_csv模块:

import numpy as np
attributes = {  
    'Private':0, 'Self-emp-not-inc':1, 'Self-emp-inc':2, 'Federal-gov':3, 'Local-gov':4, 'State-gov':5, 'Without-pay':6, 'Never-worked':7,
    'Bachelors':0, 'Some-college':1, '11th':2, 'HS-grad':3, 'Prof-school':4, 'Assoc-acdm':5, 'Assoc-voc':6, '9th':7, '7th-8th':8, '12th':9, 'Masters':10, '1st-4th':11, '10th':12,
    'Doctorate':13, '5th-6th':14, 'Preschool':15,
    'Married-civ-spouse':0, 'Divorced':1, 'Never-married':2, 'Separated':3, 'Widowed':4, 'Married-spouse-absent':5, 'Married-AF-spouse':6,
    'Tech-support':0, 'Craft-repair':1, 'Other-service':2, 'Sales':3, 'Exec-managerial':4, 'Prof-specialty':5, 'Handlers-cleaners':6, 'Machine-op-inspct':7, 'Adm-clerical':8,
    'Farming-fishing':9, 'Transport-moving':10, 'Priv-house-serv':11, 'Protective-serv':12, 'Armed-Forces':13,
    'Wife':0, 'Own-child':1, 'Husband':2, 'Not-in-family':4, 'Other-relative':5, 'Unmarried':5,
    'White':0, 'Asian-Pac-Islander':1, 'Amer-Indian-Eskimo':2, 'Other':3, 'Black':4,
    'Female':0, 'Male':1,
    'United-States':0, 'Cambodia':1, 'England':2, 'Puerto-Rico':3, 'Canada':4, 'Germany':5, 'Outlying-US(Guam-USVI-etc)':6, 'India':7, 'Japan':8, 'Greece':9, 'South':10, 'China':11,
    'Cuba':12, 'Iran':13, 'Honduras':14, 'Philippines':15, 'Italy':16, 'Poland':17, 'Jamaica':18, 'Vietnam':19, 'Mexico':20, 'Portugal':21, 'Ireland':22, 'France':23,
    'Dominican-Republic':24, 'Laos':25, 'Ecuador':26, 'Taiwan':27, 'Haiti':28, 'Columbia':29, 'Hungary':30, 'Guatemala':31, 'Nicaragua':32, 'Scotland':33, 'Thailand':34, 'Yugoslavia':35,
    'El-Salvador':36, 'Trinadad&Tobago':37, 'Peru':38, 'Hong':39, 'Holand-Netherlands':40
}
def remove_field_num(a, i):                                                                      #函数用于去除特定字段的值
    names = list(a.dtype.names)
    new_names = names[:i] + names[i + 1:]
    b = a[new_names]
    return b
def remove_missing_values(data):
    temp = []
    for i in range(len(data)):
        for j in range(len(data[i])):
            if data[i][j] == '?':                                                                 #如果遇到缺失值'?',则不将该行添加到temp中
                break;
            if j == (len(data[i]) - 1) and len(data[i]) == 15:
                temp.append(data[i])                                                              #添加不包含'?'的行
    return temp
def create_labels(data):
    temp = []
    for i in range(len(data)):                                                                    #遍历数据
        j = len(data[i]) - 1                                                                      #提取标签
        if data[i][j] == '<=50K':
            temp.append(0)
        else:
            temp.append(1)
    return temp
def convert_to_int(data):
    my_lst = []
    for i in range(len(data)):
        lst = []
        for j in range(len(data[i])):
            key = data[i][j]
            if j in (1, 3, 5, 6, 7, 8, 9, 13, 14):
                lst.append(int(attributes[key]))
            else:
                lst.append(int(key))
            my_lst.append(lst)
    temp = np.array(my_lst)
    return temp

我尝试使用了treeNaiveBayes,但准确率非常低。有什么建议吗?我错过了什么?


回答:

我认为问题出在预处理上。最好将分类变量编码为one-hot向量(仅包含零或一的向量,其中一对应于该类的所需值),而不是使用原始数字。Sklearn的DictVectorizer可以帮助你实现这一点。你可以使用pandas库更有效地进行分类。

以下展示了如何通过pandas库轻松实现这一点。它与scikit-learn配合使用效果很好。这在整个数据的20%作为测试集时达到了81.6的准确率。

from __future__ import division
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.dict_vectorizer import DictVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics.classification import classification_report, accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
# 将数据读取到pandas数据框中
df = pd.read_csv('adult.data.csv')
# 列名
cols = np.array(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
                 'marital-status', 'occupation', 'relationship', 'race', 'sex',
                 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
                 'target'])
# 数值列
numeric_cols = ['age', 'fnlwgt', 'education-num',
                'capital-gain', 'capital-loss', 'hours-per-week']
# 为数据框中的列分配名称
df.columns = cols
# 将目标变量替换为<50K和>50k的0和1
df1 = df.copy()
df1.loc[df1['target'] == ' <=50K', 'target'] = 0
df1.loc[df1['target'] == ' >50K', 'target'] = 1
# 将数据分为训练和测试集
X_train, X_test, y_train, y_test = train_test_split(
    df1.drop('target', axis=1), df1['target'], test_size=0.2)
# 数值属性
x_num_train = X_train[numeric_cols].as_matrix()
x_num_test = X_test[numeric_cols].as_matrix()
# 缩放到<0,1>
max_train = np.amax(x_num_train, 0)
max_test = np.amax(x_num_test, 0)        # 实际上不需要
x_num_train = x_num_train / max_train
x_num_test = x_num_test / max_train        # 用max_train缩放测试集
# 标签或目标属性
y_train = y_train.astype(int)
y_test = y_test.astype(int)
# 分类属性
cat_train = X_train.drop(numeric_cols, axis=1)
cat_test = X_test.drop(numeric_cols, axis=1)
cat_train.fillna('NA', inplace=True)
cat_test.fillna('NA', inplace=True)
x_cat_train = cat_train.T.to_dict().values()
x_cat_test = cat_test.T.to_dict().values()
# 向量化(编码为one-hot)
vectorizer = DictVectorizer(sparse=False)
vec_x_cat_train = vectorizer.fit_transform(x_cat_train)
vec_x_cat_test = vectorizer.transform(x_cat_test)
# 构建特征向量
x_train = np.hstack((x_num_train, vec_x_cat_train))
x_test = np.hstack((x_num_test, vec_x_cat_test))
clf = LogisticRegression().fit(x_train, y_train.values)
pred = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)
clf = DecisionTreeClassifier().fit(x_train, y_train)
predict = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)
clf = GaussianNB().fit(x_train, y_train)
predict = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注