我正在学习使用深度学习模型进行多图像分类。我使用Keras和TensorFlow来执行分类任务。
图像数据集(10,000张图片)是一个数组,标签在一个包含图像名称和10个类别的黄金标签的CSV文件中。我使用以下代码来检索图像和标签:
import pandas as pd import numpy as np path = '/content/image_data/' all_images = np.load(path + 'all_images.npy') crowd_annotations = pd.read_csv(path + 'crowd_annotationsGold.csv', encoding = 'utf-8') crowd_annotations = crowd_annotations['label'].to_numpy() print(all_images ) print(crowd_annotations)
当我打印时,我得到以下数组数据:
[[[[0.23137255 0.24313725 0.24705882] [0.16862745 0.18039216 0.17647059] [0.19607843 0.18823529 0.16862745] ... [0.61960784 0.51764706 0.42352941] [0.59607843 0.49019608 0.4 ] [0.58039216 0.48627451 0.40392157]] [[0.0627451 0.07843137 0.07843137] [0. 0. 0. ] [0.07058824 0.03137255 0. ] ... [0.48235294 0.34509804 0.21568627] [0.46666667 0.3254902 0.19607843] [0.47843137 0.34117647 0.22352941]] [[0.09803922 0.09411765 0.08235294] [0.0627451 0.02745098 0. ] [0.19215686 0.10588235 0.03137255] ... [0.4627451 0.32941176 0.19607843] [0.47058824 0.32941176 0.19607843] [0.42745098 0.28627451 0.16470588]] ...... [[0.69411765 0.56470588 0.45490196] [0.65882353 0.50588235 0.36862745] [0.70196078 0.55686275 0.34117647] ..... [0.84705882 0.72156863 0.54901961] [0.59215686 0.4627451 0.32941176] [0.48235294 0.36078431 0.28235300]]] [[[0.60392157 0.69411765 0.73333333] [0.49411765 0.5372549 0.53333333] [0.41176471 0.40784314 0.37254902] ..... [0.35686275 0.37254902 0.27843737] [0.34117647 0.35294118 0.27843137] [0.30980392 0.31764706 0.2745098 ]] [[0.54901961 0.62745098 0.6627451 ] [0.56862745 0.6 0.60392157] [0.49019608 0.49019608 0.4627451 ] ... [0.37647059 0.38823529 0.30588235] [0.30196078 0.31372549 0.24313725] [0.27843137 0.28627451 0.23921569]] [[0.54901961 0.60784314 0.64313725] [0.54509804 0.57254902 0.58431373] [0.45098039 0.45098039 0.43921569] ... [0.30980392 0.32156863 0.25098039] [0.26666667 0.2745098 0.21568627] [0.2627451 0.27058824 0.21568627]] ....... [[0.58823529 0.56078431 0.52941176] [0.54901961 0.52941176 0.49803922] [0.51764706 0.49803922 0.47058824] ... [0.87843137 0.87058824 0.85490196] [0.90196078 0.89411765 0.88235294] [0.94509804 0.94509804 0.93333333]] [[0.5372549 0.51764706 0.49411765] [0.50980392 0.49803922 0.47058824] [0.49019608 0.4745098 0.45098039] ... [0.70980392 0.70588235 0.69803922] [0.79215686 0.78823529 0.77647059] [0.83137255 0.82745098 0.81176471]] [[0.47843137 0.46666667 0.44705882] [0.4627451 0.45490196 0.43137255] [0.47058824 0.45490196 0.43529412] ... [0.70196078 0.69411765 0.67843137] [0.64313725 0.64313725 0.63529412] [0.63921569 0.63921569 0.63137255]]]] [0 0 0 ... 9 9 9]
我想将数据划分为训练集和测试集。
划分后,我希望得到一个训练数据数组和一个测试数据数组,并分别保存它们。同样,训练标签列表或CSV文件和测试标签列表或CSV文件也需要分别保存。由于数据是按排序顺序排列的,因此在划分为训练集和测试集之前可能需要进行洗牌操作。
之后,我希望使用它们来训练模型,然后进行评估。测试集的大小应为20%。
回答:
这可以通过使用StratifiedShuffleSplit来实现,它会以分层的方式返回训练集和测试集的索引(*这也确保了在两个集合中类别的良好分布/代表性*)。
一旦你有了索引,你只需切片你的数据,然后按你希望的方式保存它们。
下面是一个简单的示例,用于说明:
from sklearn.datasets import make_classificationfrom sklearn.model_selection import StratifiedShuffleSplitSEED = 2020 # for reproducibility due to the shuffling# create some random classification data - make it small for printing outX, Y = make_classification(n_samples=20, n_features=3, n_informative=3, n_redundant=0, n_repeated=0, n_clusters_per_class=1, n_classes=3, random_state=SEED)print("X Original: \n{}\n".format(X))print("Y Original: \n{}\n".format(Y))# perform stratified shuffle split. Note the SEED usage for shuffling.sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=SEED)train_index, test_index = next(sss.split(X, Y))X_train, X_test = X[train_index], X[test_index]Y_train, Y_test = Y[train_index], Y[test_index]print("X_Train: \n{}\n".format(X_train))print("Y_Train: \n{}\n".format(Y_train))print("X_Test: \n{}\n".format(X_test))print("Y_Test: \n{}\n".format(Y_test))# your code for saving X_train and X_test in separate NPY files, goes here# your code for saving Y_train and Y_test in separate CSV files, goes here
输出
X Original: [[-0.69590064 -0.67561329 0.62524618] [-1.09492175 1.27630932 2.15598887] [-0.51743065 0.63402055 2.12912755] [-1.18819319 -0.42454412 1.49949316] [-2.09612492 0.89610929 -0.34134785] [ 1.06615086 -2.74141467 -0.26813435] [-0.88205757 0.84812284 -0.65742989] [-0.95747896 -1.70466278 0.69822828] [-0.15885567 -0.15289292 -1.00694331] [-0.93374229 -0.79402593 1.00909515] [-0.90636868 2.75448909 1.772864 ] [ 0.62005229 -1.3732454 -0.39237323] [ 0.74139934 -1.05271986 -0.9964703 ] [-1.81968206 1.53213677 -0.94698653] [-0.43419928 0.90834502 2.05707125] [-0.19206677 0.3104947 0.11505178] [-0.19129044 -0.39785095 -0.13277081] [-1.64958117 1.57707358 0.67063495] [-1.27544266 -1.26647034 1.3965837 ] [ 1.63351975 -0.85734405 -1.52143762]]Y Original: [1 0 0 1 0 2 2 1 2 1 0 2 2 0 0 1 1 0 1 2]X_Train: [[-0.51743065 0.63402055 2.12912755] [ 1.63351975 -0.85734405 -1.52143762] [-0.93374229 -0.79402593 1.00909515] [ 0.74139934 -1.05271986 -0.9964703 ] [ 1.06615086 -2.74141467 -0.26813435] [-2.09612492 0.89610929 -0.34134785] [-1.27544266 -1.26647034 1.3965837 ] [-0.15885567 -0.15289292 -1.00694331] [-0.19206677 0.3104947 0.11505178] [-0.43419928 0.90834502 2.05707125] [-1.64958117 1.57707358 0.67063495] [-1.18819319 -0.42454412 1.49949316] [-0.95747896 -1.70466278 0.69822828] [-1.81968206 1.53213677 -0.94698653]]Y_Train: [0 2 1 2 2 0 1 2 1 0 0 1 1 0]X_Test: [[-0.88205757 0.84812284 -0.65742989] [-0.90636868 2.75448909 1.772864 ] [-0.69590064 -0.67561329 0.62524618] [ 0.62005229 -1.3732454 -0.39237323] [-0.19129044 -0.39785095 -0.13277081] [-1.09492175 1.27630932 2.15598887]]Y_Test: [2 0 1 2 1 0]
更新
根据你下面的评论,如果你想从包含现有列的DataFrame
中重建相同的CSV
文件(即除了label
之外的列,这是不常见的)。你仍然可以按如下方式切片原始的pandas DataFrame
crowd_annotations
:
首先,将CSV
文件加载到crowd_annotations_
中(注意下划线):
crowd_annotations_ = pd.read_csv(path + 'crowd_annotationsGold.csv', encoding = 'utf-8')
然后单独获取label
:
crowd_annotations = crowd_annotations_['label'].to_numpy()
继续按照上面的例子,将all_images
和crowd_annotations
分割,它们分别对应于X
和Y
。
最后,在crowd_annotations
(如上例所示)和crowd_annotations_
中同时使用train_index
和test_index
,如下所示:
crowd_annotations_train = crowd_annotations_.iloc[train_index]crowd_annotations_test = crowd_annotations_.iloc[test_index]# save crowd_annotations_train and crowd_annotations_test as `CSV`