我已经尝试了几天在Google Colab上使用GPU和TuriCreate来训练一个对象检测模型。
根据TuriCreate的存储库,要在训练过程中使用GPU,必须遵循以下说明:
https://github.com/apple/turicreate/blob/main/LinuxGPU.md
然而,每次我开始训练时,shell会在开始训练前输出以下内容:
"Using CPU to create model."
我的Colab结构如下:
设置CUDA环境
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin!sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600!sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub!sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"!sudo apt-get update!wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb!sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb!sudo apt-get update!wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb!sudo apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb!sudo apt-get update# Install development and runtime libraries (~4GB)!sudo apt-get install --no-install-recommends \ cuda-11-0 \ libcudnn8=8.0.4.30-1+cuda11.0 \ libcudnn8-dev=8.0.4.30-1+cuda11.0# Install TensorRT. Requires that libcudnn8 is installed above.!sudo apt-get install -y --no-install-recommends libnvinfer7=7.1.3-1+cuda11.0 \ libnvinfer-dev=7.1.3-1+cuda11.0 \ libnvinfer-plugin7=7.1.3-1+cuda11.0tc.config.set_num_gpus(-1)model = tc.object_detector.create(train_sf)scores = model.evaluate(valid_sf)print(scores['mean_average_precision'])model.export_coreml('model.mlmodel')
使用nvidia-smi检查安装
+-----------------------------------------------------------------------------+| NVIDIA-SMI 470.57.02 Driver Version: 460.32.03 CUDA Version: 11.2 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. ||===============================+======================+======================|| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 || N/A 33C P8 27W / 149W | 0MiB / 11441MiB | 0% Default || | | N/A |+-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| No running processes found |+-----------------------------------------------------------------------------+
依赖安装
!pip install turicreate!pip uninstall -y tensorflow!pip install tensorflow-gpu
设置bash环境变量
!echo export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH >> ~/.bashrc
训练
tc.config.set_num_gpus(-1)model = tc.object_detector.create(train_sf)scores = model.evaluate(valid_sf)print(scores['mean_average_precision'])model.export_coreml('model.mlmodel')
这是输出
TuriCreate currently only supports using one GPU. Setting 'num_gpus' to 1.Using 'image' as feature columnUsing 'annotations' as annotations columnUsing CPU to create model.Setting 'batch_size' to 32
我无法理解我遗漏了什么。
回答:
我设法解决了这个问题:问题是由于Colab机器上预装的tensorflow版本引起的。
!pip uninstall -y tensorflow!pip uninstall -y tensorflow-gpu!pip install turicreate!pip install tensorflow==2.4.0