使用不平衡数据构建机器学习分类器

我有一个包含1400个观测值和19列的数据集。目标变量的值为1（我最感兴趣的值）和0。类别的分布显示出不平衡（70:30）。

使用下面的代码，我得到了奇怪的值（全是1）。我无法确定这是由于过拟合/不平衡数据问题还是特征选择的问题（因为所有值都是数值/布尔值，所以我使用了皮尔逊相关系数）。我认为所遵循的步骤可能是错误的。

import numpy as npimport mathimport sklearn.metrics as metricsfrom sklearn.metrics import f1_scorey = df['Label']X = df.drop('Label',axis=1)def create_cv(X,y):    if type(X)!=np.ndarray:        X=X.values        y=y.values     test_size=1/5    proportion_of_true=y[y==1].shape[0]/y.shape[0]    num_test_samples=math.ceil(y.shape[0]*test_size)    num_test_true_labels=math.floor(num_test_samples*proportion_of_true)    num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)        y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])    y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])    X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)    X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)    return X_train,X_test,y_train,y_testX_train,X_test,y_train,y_test=create_cv(X,y)X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)    tree = DecisionTreeClassifier(max_depth = 5)tree.fit(X_train, y_train)       y_predict_test = tree.predict(X_test)print(classification_report(y_test, y_predict_test))f1_score(y_test, y_predict_test)

输出结果：

     precision    recall  f1-score   support           0       1.00      1.00      1.00        24           1       1.00      1.00      1.00        70    accuracy                           1.00        94   macro avg       1.00      1.00      1.00        94weighted avg       1.00      1.00      1.00        94

在构建分类器时，当数据不平衡且使用交叉验证和/或欠采样时，有人遇到过类似的问题吗？如果您想复制输出结果，我很乐意分享整个数据集。我希望得到一些明确的回答，以显示步骤和我的错误之处。

我知道，为了减少过拟合并处理平衡数据，有一些方法，如随机采样（过采样/欠采样）、SMOTE、交叉验证。我的想法是

考虑不平衡情况，将数据分为训练集/测试集
在训练集上进行交叉验证
仅在测试折叠上应用欠采样
在通过交叉验证选择模型后，对训练集进行欠采样并训练分类器
在未触及的测试集上估计性能（f1分数）

如这个问题中所概述：在测试折叠上进行交叉验证和欠采样。

我认为上述步骤应该是有道理的，但我很乐意收到您对此的任何反馈意见。

回答：

当您的数据不平衡时，您必须进行分层处理。通常的方法是对较少的值进行过采样。

另一种选择是用较少的数据来训练您的算法。如果您有一个好的数据集，这应该不是问题。在这种情况下，您首先从较少表示的类中获取样本，然后使用该集合的大小来计算从另一个类中获取多少样本：

以下代码可以帮助您以这种方式分割数据集：

def split_dataset(dataset: pd.DataFrame, train_share=0.8):    """Splits the dataset into training and test sets"""    all_idx = range(len(dataset))    train_count = int(len(all_idx) * train_share)    train_idx = random.sample(all_idx, train_count)    test_idx = list(set(all_idx).difference(set(train_idx)))    train = dataset.iloc[train_idx]    test = dataset.iloc[test_idx]    return train, testdef split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):    """Splits the dataset as in `split_dataset` but with stratification"""    data_pos = dataset[dataset[target_attr] == positive_class]    data_neg = dataset[dataset[target_attr] != positive_class]    if len(data_pos) < len(data_neg):        train_pos, test_pos = split_dataset(data_pos, train_share)        train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))        # set.difference makes the test set larger        test_neg = test_neg.iloc[0:len(test_pos)]    else:        train_neg, test_neg = split_dataset(data_neg, train_share)        train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))        # set.difference makes the test set larger        test_pos = test_pos.iloc[0:len(test_neg)]    return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \           test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)

使用方法：

train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)

现在您可以在train_ds上进行交叉验证，并在test_ds上评估您的模型。

学技术

使用不平衡数据构建机器学习分类器

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复