预处理图像数据集,包括划分为训练集和测试集

我正在学习使用深度学习模型进行多图像分类。我使用Keras和TensorFlow来执行分类任务。

图像数据集(10,000张图片)是一个数组,标签在一个包含图像名称和10个类别的黄金标签的CSV文件中。我使用以下代码来检索图像和标签:

   import pandas as pd   import numpy as np   path = '/content/image_data/'   all_images = np.load(path + 'all_images.npy')   crowd_annotations = pd.read_csv(path + 'crowd_annotationsGold.csv', encoding = 'utf-8')   crowd_annotations = crowd_annotations['label'].to_numpy()   print(all_images )   print(crowd_annotations)

当我打印时,我得到以下数组数据:

          [[[[0.23137255 0.24313725 0.24705882]             [0.16862745 0.18039216 0.17647059]             [0.19607843 0.18823529 0.16862745]              ...             [0.61960784 0.51764706 0.42352941]             [0.59607843 0.49019608 0.4       ]             [0.58039216 0.48627451 0.40392157]]            [[0.0627451  0.07843137 0.07843137]             [0.         0.         0.        ]             [0.07058824 0.03137255 0.        ]              ...             [0.48235294 0.34509804 0.21568627]             [0.46666667 0.3254902  0.19607843]             [0.47843137 0.34117647 0.22352941]]            [[0.09803922 0.09411765 0.08235294]             [0.0627451  0.02745098 0.        ]             [0.19215686 0.10588235 0.03137255]              ...             [0.4627451  0.32941176 0.19607843]             [0.47058824 0.32941176 0.19607843]             [0.42745098 0.28627451 0.16470588]]             ......            [[0.69411765 0.56470588 0.45490196]             [0.65882353 0.50588235 0.36862745]             [0.70196078 0.55686275 0.34117647]              .....             [0.84705882 0.72156863 0.54901961]             [0.59215686 0.4627451  0.32941176]             [0.48235294 0.36078431 0.28235300]]]           [[[0.60392157 0.69411765 0.73333333]             [0.49411765 0.5372549  0.53333333]             [0.41176471 0.40784314 0.37254902]              .....             [0.35686275 0.37254902 0.27843737]             [0.34117647 0.35294118 0.27843137]             [0.30980392 0.31764706 0.2745098 ]]            [[0.54901961 0.62745098 0.6627451 ]             [0.56862745 0.6        0.60392157]             [0.49019608 0.49019608 0.4627451 ]              ...             [0.37647059 0.38823529 0.30588235]             [0.30196078 0.31372549 0.24313725]             [0.27843137 0.28627451 0.23921569]]            [[0.54901961 0.60784314 0.64313725]             [0.54509804 0.57254902 0.58431373]             [0.45098039 0.45098039 0.43921569]              ...             [0.30980392 0.32156863 0.25098039]             [0.26666667 0.2745098  0.21568627]             [0.2627451  0.27058824 0.21568627]]             .......           [[0.58823529 0.56078431 0.52941176]            [0.54901961 0.52941176 0.49803922]            [0.51764706 0.49803922 0.47058824]             ...            [0.87843137 0.87058824 0.85490196]            [0.90196078 0.89411765 0.88235294]            [0.94509804 0.94509804 0.93333333]]           [[0.5372549  0.51764706 0.49411765]            [0.50980392 0.49803922 0.47058824]            [0.49019608 0.4745098  0.45098039]             ...            [0.70980392 0.70588235 0.69803922]            [0.79215686 0.78823529 0.77647059]            [0.83137255 0.82745098 0.81176471]]           [[0.47843137 0.46666667 0.44705882]            [0.4627451  0.45490196 0.43137255]            [0.47058824 0.45490196 0.43529412]             ...            [0.70196078 0.69411765 0.67843137]            [0.64313725 0.64313725 0.63529412]            [0.63921569 0.63921569 0.63137255]]]]            [0 0 0 ... 9 9 9]

我想将数据划分为训练集和测试集。

划分后,我希望得到一个训练数据数组和一个测试数据数组,并分别保存它们。同样,训练标签列表或CSV文件和测试标签列表或CSV文件也需要分别保存。由于数据是按排序顺序排列的,因此在划分为训练集和测试集之前可能需要进行洗牌操作。

之后,我希望使用它们来训练模型,然后进行评估。测试集的大小应为20%。


回答:

这可以通过使用StratifiedShuffleSplit来实现,它会以分层的方式返回训练集和测试集的索引(*这也确保了在两个集合中类别的良好分布/代表性*)。

一旦你有了索引,你只需切片你的数据,然后按你希望的方式保存它们。


下面是一个简单的示例,用于说明:

from sklearn.datasets import make_classificationfrom sklearn.model_selection import StratifiedShuffleSplitSEED = 2020  # for reproducibility due to the shuffling# create some random classification data - make it small for printing outX, Y = make_classification(n_samples=20, n_features=3, n_informative=3,                           n_redundant=0, n_repeated=0, n_clusters_per_class=1,                           n_classes=3, random_state=SEED)print("X Original: \n{}\n".format(X))print("Y Original: \n{}\n".format(Y))# perform stratified shuffle split. Note the SEED usage for shuffling.sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=SEED)train_index, test_index = next(sss.split(X, Y))X_train, X_test = X[train_index], X[test_index]Y_train, Y_test = Y[train_index], Y[test_index]print("X_Train: \n{}\n".format(X_train))print("Y_Train: \n{}\n".format(Y_train))print("X_Test: \n{}\n".format(X_test))print("Y_Test: \n{}\n".format(Y_test))# your code for saving X_train and X_test in separate NPY files, goes here# your code for saving Y_train and Y_test in separate CSV files, goes here

输出

X Original: [[-0.69590064 -0.67561329  0.62524618] [-1.09492175  1.27630932  2.15598887] [-0.51743065  0.63402055  2.12912755] [-1.18819319 -0.42454412  1.49949316] [-2.09612492  0.89610929 -0.34134785] [ 1.06615086 -2.74141467 -0.26813435] [-0.88205757  0.84812284 -0.65742989] [-0.95747896 -1.70466278  0.69822828] [-0.15885567 -0.15289292 -1.00694331] [-0.93374229 -0.79402593  1.00909515] [-0.90636868  2.75448909  1.772864  ] [ 0.62005229 -1.3732454  -0.39237323] [ 0.74139934 -1.05271986 -0.9964703 ] [-1.81968206  1.53213677 -0.94698653] [-0.43419928  0.90834502  2.05707125] [-0.19206677  0.3104947   0.11505178] [-0.19129044 -0.39785095 -0.13277081] [-1.64958117  1.57707358  0.67063495] [-1.27544266 -1.26647034  1.3965837 ] [ 1.63351975 -0.85734405 -1.52143762]]Y Original: [1 0 0 1 0 2 2 1 2 1 0 2 2 0 0 1 1 0 1 2]X_Train: [[-0.51743065  0.63402055  2.12912755] [ 1.63351975 -0.85734405 -1.52143762] [-0.93374229 -0.79402593  1.00909515] [ 0.74139934 -1.05271986 -0.9964703 ] [ 1.06615086 -2.74141467 -0.26813435] [-2.09612492  0.89610929 -0.34134785] [-1.27544266 -1.26647034  1.3965837 ] [-0.15885567 -0.15289292 -1.00694331] [-0.19206677  0.3104947   0.11505178] [-0.43419928  0.90834502  2.05707125] [-1.64958117  1.57707358  0.67063495] [-1.18819319 -0.42454412  1.49949316] [-0.95747896 -1.70466278  0.69822828] [-1.81968206  1.53213677 -0.94698653]]Y_Train: [0 2 1 2 2 0 1 2 1 0 0 1 1 0]X_Test: [[-0.88205757  0.84812284 -0.65742989] [-0.90636868  2.75448909  1.772864  ] [-0.69590064 -0.67561329  0.62524618] [ 0.62005229 -1.3732454  -0.39237323] [-0.19129044 -0.39785095 -0.13277081] [-1.09492175  1.27630932  2.15598887]]Y_Test: [2 0 1 2 1 0]

更新

根据你下面的评论,如果你想从包含现有列的DataFrame中重建相同的CSV文件(即除了label之外的列,这是不常见的)。你仍然可以按如下方式切片原始的pandas DataFrame crowd_annotations

首先,将CSV文件加载到crowd_annotations_中(注意下划线):

crowd_annotations_ = pd.read_csv(path + 'crowd_annotationsGold.csv', encoding = 'utf-8')

然后单独获取label

crowd_annotations = crowd_annotations_['label'].to_numpy()

继续按照上面的例子,将all_imagescrowd_annotations分割,它们分别对应于XY

最后,在crowd_annotations(如上例所示)和crowd_annotations_中同时使用train_indextest_index,如下所示:

crowd_annotations_train = crowd_annotations_.iloc[train_index]crowd_annotations_test = crowd_annotations_.iloc[test_index]# save crowd_annotations_train and crowd_annotations_test as `CSV`

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注