我(仍然)在尝试使用Tensorflow 2.0后端的Keras实现一个简单的Unet网络。
我的模板和掩码是1536×1536的RGB图像(掩码是黑白的)。根据这篇文章,可以测量所需内存的数量。
我的模型在处理张量[1,16,1536,1536]时因内存分配错误而崩溃。使用上述文章中给出的公式,我计算了该张量所需的内存量:1 * 16 * 1536 * 1536 * 4 = 144 Mbytes。我有一块GTX 1080 Ti,约有9 Gbytes的内存可供Tensorflow使用。哪里出了问题?我是否遗漏了什么?
这是一个几乎完整的追溯信息:
2020-03-02 15:59:13.841967: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll2020-03-02 15:59:16.083234: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX22020-03-02 15:59:16.087240: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll2020-03-02 15:59:16.210856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.607pciBusID: 0000:41:00.02020-03-02 15:59:16.210988: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.2020-03-02 15:59:16.211429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 02020-03-02 15:59:16.947775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:2020-03-02 15:59:16.947868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2020-03-02 15:59:16.947922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2020-03-02 15:59:16.948594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8784 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)2020-03-02 15:59:16.994676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.607pciBusID: 0000:41:00.02020-03-02 15:59:16.994849: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.2020-03-02 15:59:16.995291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 02020-03-02 15:59:16.995793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.607pciBusID: 0000:41:00.02020-03-02 15:59:16.995908: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.2020-03-02 15:59:16.996301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 02020-03-02 15:59:16.996406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:2020-03-02 15:59:16.996491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2020-03-02 15:59:16.996541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2020-03-02 15:59:16.996942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8784 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)2020-03-02 15:59:18.191834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.607pciBusID: 0000:41:00.02020-03-02 15:59:18.191964: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.2020-03-02 15:59:18.192383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 02020-03-02 15:59:18.192499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:2020-03-02 15:59:18.192591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2020-03-02 15:59:18.192644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2020-03-02 15:59:18.193053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 8784 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)Epoch 1/1002020-03-02 15:59:18.421211: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll2020-03-02 15:59:19.577897: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 512.00M (536870912 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory2020-03-02 15:59:19.616600: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 460.80M (483183872 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory2020-03-02 15:59:19.638395: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on WindowsRelying on driver to perform ptx compilation. This message will be only logged once.2020-03-02 15:59:19.644478: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory2020-03-02 15:59:19.644601: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.2020-03-02 15:59:19.653644: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory2020-03-02 15:59:19.653767: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 259.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.2020-03-02 15:59:19.865828: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory2020-03-02 15:59:19.874844: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory2020-03-02 15:59:29.884662: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory2020-03-02 15:59:29.893593: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory2020-03-02 15:59:29.893792: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 144.00MiB (rounded to 150994944). Current allocation summary follows.2020-03-02 15:59:29.919126: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 1054574080 memory_limit_: 9210949796 available bytes: 8156375716 curr_region_allocation_bytes_: 10737418242020-03-02 15:59:29.919304: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: Limit: 9210949796InUse: 1010432000MaxInUse: 1010432000NumAllocs: 594MaxAllocSize: 2838707202020-03-02 15:59:29.919520: W tensorflow/core/common_runtime/bfc_allocator.cc:424] *****__****************xxxxxxxxxx***************xxxxxxxxxx******************************xxxxxxxxxxxx2020-03-02 15:59:29.919696: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at conv_ops.cc:947 : Resource exhausted: OOM when allocating tensor with shape[1,16,1536,1536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfcTraceback (most recent call last): File "E:/Explorium/python/unet_trainer.py", line 82, in <module> results = model.fit_generator(train_generator, epochs=EPOCHS, steps_per_epoch=STEPS_PER_EPOCH, validation_data=val_generator, validation_steps=VALIDATION_STEPS, callbacks=callbacks) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 1297, in fit_generator steps_name='steps_per_epoch') File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\training_generator.py", line 265, in model_iteration batch_outs = batch_function(*batch_data) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 973, in train_on_batch class_weight=class_weight, reset_metrics=reset_metrics) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 264, in train_on_batch output_loss_metrics=model._output_loss_metrics) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\training_eager.py", line 311, in train_on_batch output_loss_metrics=output_loss_metrics)) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\training_eager.py", line 252, in _process_single_batch training=training)) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\training_eager.py", line 127, in _model_loss outs = model(inputs, **kwargs) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__ outputs = self.call(cast_inputs, *args, **kwargs) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 708, in call convert_kwargs_to_constants=base_layer_utils.call_context().saving) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 860, in _run_internal_graph output_tensors = layer(computed_tensors, **kwargs) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__ outputs = self.call(cast_inputs, *args, **kwargs) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\keras\layers\convolutional.py", line 197, in call outputs = self._convolution_op(inputs, self.kernel) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 1134, in __call__ return self.conv_op(inp, filter) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 639, in __call__ return self.call(inp, filter) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 238, in __call__ name=self.name) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2010, in conv2d name=name) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1031, in conv2d data_format=data_format, dilations=dilations, name=name, ctx=_ctx) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1130, in conv2d_eager_fallback ctx=_ctx, name=name) File "C:\Users\E-soft\Anaconda3\envs\Explorium\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File "<string>", line 3, in raise_fromtensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,16,1536,1536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Conv2D]Process finished with exit code 1
这是我的模型:
...
回答:
你遇到的问题在于图像的尺寸。
问题不是模型的尺寸,如评论中其他人所述,而是你的图像的输入尺寸,需要更多的GPU内存来处理。
在你的情况下,解决方案是将图像的尺寸缩小一半。你需要以相同的因子同时缩小宽度和高度,以保持纵横比,从而允许网络在较小的图像上学习,而不会丢失太多信息和引入失真。
你将能够在768×768的尺寸上使用批次大小为1在你的GTX 1080上进行训练(我有一块GTX 1080Ti,我测试了几种分割网络和几种输入尺寸)。如果由于其他进程(如YouTube或类似的)占用了你的GPU内存,那么将其缩小到512×512肯定会有效(即使在批次大小为1的情况下,768×768也应该有效)。