使用混合类型特征的scikit learn分类器在测试数据上返回0%准确率

我刚开始学习机器学习和Python。我想使用sklearn中的DecisionTreeClassifier。由于我的特征部分是数值型,部分是分类型,我需要对它们进行转换,因为DecisionTreeClassifier只接受数值型特征作为输入。为此,我使用了ColumnTransformer和管道。想法如下:

  1. 分类和数值特征在不同的管道中进行转换
  2. 两者结合形成分类器的输入

然而,使用我的测试数据的准确率总是0%,而我的训练数据的准确率约为85%。此外,调用cross_val_score()返回

ValueError: Found unknown categories ['Holand-Netherlands'] in column 7 during transform

这很奇怪,因为我用这些数据训练了full_pipeline。使用不同的分类器会导致相同的结果,这让我认为转换存在问题。非常感谢您的帮助!

以下是我的代码:

names = ["age",         "workclass",         "final-weight",         "education",         "education-num",         "martial-status",         "occupation",         "relationship",         "race",         "sex",         "capital-gain",         "capial-loss",         "hours-per-week",         "native-country",         "agrossincome"]categorical_features = ["workclass", "education", "martial-status", "occupation", "relationship", "race", "sex", "native-country"]numerical_features = ["age","final-weight", "education-num", "capital-gain", "capial-loss", "hours-per-week"] features = np.concatenate([categorical_features, numerical_features])# create pandas dataframe for adult datasetadult_train = pd.read_csv(filepath_or_buffer= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" ,            delimiter= ',',            index_col = False,            skipinitialspace = True,            header = None,            names = names )adult_test = pd.read_csv( filepath_or_buffer= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test" ,            delimiter= ',',            index_col = False,            skipinitialspace = True,            header = None,            names = names )adult_test.drop(0, inplace =True)adult_test.reset_index(inplace = True)adult_train.replace(to_replace= "?", value = np.NaN, inplace = True)adult_test.replace(to_replace= "?", value = np.NaN, inplace= True)# split data into features and targetsx_train = adult_train[features]y_train = adult_train.agrossincomex_test = adult_test[features]y_test = adult_test.agrossincome# create pipeline for preprocessing + classifiercategorical_pipeline = Pipeline( steps = [ ( 'imputer', SimpleImputer(strategy='constant', fill_value='missing') ),                                           ( 'encoding', OrdinalEncoder() )                                          ])numerical_pipeline = Pipeline( steps = [ ( 'imputer', SimpleImputer(strategy='median') ),                                         ( 'std_scaler', StandardScaler( with_mean = False ) )                                        ])preprocessing = ColumnTransformer( transformers = [ ( 'categorical_pipeline', categorical_pipeline, categorical_features ),                                                    ( 'numerical_pipeline', numerical_pipeline, numerical_features ) ] )full_pipeline = Pipeline(steps= [ ('preprocessing', preprocessing),                                  ('model', DecisionTreeClassifier(random_state= 0, max_depth = 5) ) ])full_pipeline.fit(x_train, y_train)print(full_pipeline.score(x_test, y_test))#print(cross_val_score(full_pipeline, x_train, y_train, cv=3).mean())

回答:

错误来自于y_test,它看起来像

enter image description here

enter image description here

删除末尾的’.’应该可以解决这个问题

enter image description here

Related Posts

在使用k近邻算法时,有没有办法获取被使用的“邻居”?

我想找到一种方法来确定在我的knn算法中实际使用了哪些…

Theano在Google Colab上无法启用GPU支持

我在尝试使用Theano库训练一个模型。由于我的电脑内…

准确性评分似乎有误

这里是代码: from sklearn.metrics…

Keras Functional API: “错误检查输入时:期望input_1具有4个维度,但得到形状为(X, Y)的数组”

我在尝试使用Keras的fit_generator来训…

如何使用sklearn.datasets.make_classification在指定范围内生成合成数据?

我想为分类问题创建合成数据。我使用了sklearn.d…

如何处理预测时不在训练集中的标签

已关闭。 此问题与编程或软件开发无关。目前不接受回答。…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注