我尝试将数据集按8:1:1的比例分割,我的数据集位于一个目录中,起初我尝试了以下代码
train_ds = tf.keras.preprocessing.image_dataset_from_directory( dir, validation_split=0.2, subset="training", seed=123, image_size=(img_height, img_width), batch_size=batch_size)val_ds = tf.keras.preprocessing.image_dataset_from_directory( dir, validation_split=0.1, subset="validation", seed=123, image_size=(img_height, img_width), batch_size=batch_size)
但这并没有完成任务,只是将我的目录分割成了val_ds和test_ds,之后我使用了以下代码
# 创建数据生成器datagen = ImageDataGenerator()# 加载并迭代训练数据集train_it = datagen.flow_from_directory(dir, target_size=(32, 32), color_mode='grayscale', class_mode='binary', batch_size=32, shuffle=True, follow_links=False, subset=None, interpolation='nearest')# 加载并迭代验证数据集val_it = datagen.flow_from_directory(dir, target_size=(32, 32), color_mode='grayscale', class_mode='binary', batch_size=32, shuffle=True, follow_links=False, subset=None, interpolation='nearest')# 加载并迭代测试数据集test_it = datagen.flow_from_directory(dir, target_size=(32, 32), color_mode='grayscale', class_mode='binary', batch_size=32, shuffle=True, follow_links=False, subset=None, interpolation='nearest')
这段代码在我的模型中也存在问题,所以当我使用这段代码时,我的模型摘要会是这样的
Model: "sequential_1"_________________________________________________________________Layer (type) Output Shape Param # =================================================================rescaling_1 (Rescaling) (None, None, None, None) 0 _________________________________________________________________conv2d_3 (Conv2D) (None, None, None, 32) 320 _________________________________________________________________max_pooling2d_3 (MaxPooling2 (None, None, None, 32) 0 _________________________________________________________________conv2d_4 (Conv2D) (None, None, None, 32) 9248 _________________________________________________________________max_pooling2d_4 (MaxPooling2 (None, None, None, 32) 0 _________________________________________________________________conv2d_5 (Conv2D) (None, None, None, 32) 9248 _________________________________________________________________max_pooling2d_5 (MaxPooling2 (None, None, None, 32) 0 _________________________________________________________________dropout_1 (Dropout) (None, None, None, 32) 0 _________________________________________________________________flatten_1 (Flatten) (None, None) 0 _________________________________________________________________dense_2 (Dense) (None, 128) 16512 _________________________________________________________________dense_3 (Dense) (None, 26) 3354 =================================================================Total params: 38,682Trainable params: 38,682Non-trainable params: 0_________________________________________________________________
这是我的模型
num_classes = 26model = tf.keras.Sequential([ tf.keras.layers.experimental.preprocessing.Rescaling(1./255), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.MaxPooling2D(), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.MaxPooling2D(), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.MaxPooling2D(), layers.Dropout(0.2), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(num_classes)])model.compile( optimizer='adam', loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
所以我想知道如何无问题地分割我的数据?
回答:
你可以按以下步骤操作
import glob # 用于获取所有图像的完整路径'''假设你的图像位于'dir'文件夹内的两个文件夹'a'和'b'中。要获取每个图像的路径,你可以使用下面的代码'''image_paths_a = glob.glob('./dir/a/*.jpg') # 如果文件以.jpg结尾image_paths_b = glob.glob('./dir/b/*.jpg') # 获取b文件夹中的图像images_total = image_paths_a + image_paths_b# 如果你有其他文件夹,你也可以这样做# 获取'dir'文件夹内所有文件夹中的所有图像images_total = glob.glob('./dir/*/*.jpg') # 现在获取这些图像对应的标签# 如果你按文件夹名称标记,可以这样做image_labels = [i.split('/')[-2] for i in images_total]'''完成上述操作后,你将得到两个列表 -> 1) 图像路径 2) 对应的标签,现在你可以使用'sklearn.model_selection.train_test_split'来获取你的分割'''from sklearn.model_selection import train_test_split# 设置训练数据并获取剩余20%用于进一步分割xtrain, xtest, ytrain, ytest = trian_test_split(images_total, image_labels, stratify=image_labels, random_state=1234, test_size=0.2)# 获取原始数据的10%-10%xvalid, xtest, yvalid, ytest= trian_test_split(xtest, ytest, stratify=ytest, random_state=1234, test_size=0.5)'''现在你可以创建数据集,但在创建之前,你需要创建一个函数来从图像路径读取图像。'''def read_img(path, label): file = tf.io.read_file(path) img = tf.image.decode_png(file) # dim1和dim2是你想要的尺寸 img = tf.image.resize(img, (dim1, dim2)) return img, labeltrain_dataset = tf.data.Dataset.from_tensor_slices((xtrain, ytrain))train_dataset = train_dataset.map(read_img).batch(batch_size)valid_dataset = tf.data.Dataset.from_tensor_slices((xvalid, yvalid))valid_dataset = valid_dataset.map(read_img).batch(batch_size)test_dataset = tf.data.Dataset.from_tensor_slices((xtest, ytest))test_dataset = test_dataset.map(read_img).batch(batch_size)# 现在你只需训练你的模型model.fit(train_dataset, epochs=5, validation_data=valid_dataset)