我在进行自动分割,之前周末在训练模型时突然停电。我的模型已经训练了50多个小时,并且每5个周期保存一次模型,使用以下代码行:
model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)
我使用以下代码行加载保存的模型:
model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})
我已经准备好所有数据,将训练数据分割为train_x
用于扫描和train_y
用于标签。当我运行以下代码行时:
loss, dice_coef = model.evaluate(train_x, train_y, verbose=1)
我遇到了以下错误:
ResourceExhaustedError: OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_distributed_function_3673]Function call stack:distributed_function
回答:
这基本上意味着你遇到了内存不足的问题。因此,你需要以较小的批次进行评估。默认批次大小为32,尝试分配更小的批次大小。
evaluate(train_x, train_y, batch_size=<batch size>)
来自 keras文档
batch_size: 整数或None。每次梯度更新的样本数。如果未指定,batch_size将默认为32。