我有一个在本地运行良好的tensorflow对象检测API的小型工作示例。一切看起来都很棒。我的目标是使用他们的脚本在Google Machine Learning Engine上运行,我过去广泛使用过这个平台。我正在遵循这些文档。
声明一些相关的变量
declare PROJECT=$(gcloud config list project --format "value(core.project)")declare BUCKET="gs://${PROJECT}-ml"declare MODEL_NAME="DeepMeerkatDetection"declare FOLDER="${BUCKET}/${MODEL_NAME}"declare JOB_ID="${MODEL_NAME}_$(date +%Y%m%d_%H%M%S)"declare TRAIN_DIR="${FOLDER}/${JOB_ID}"declare EVAL_DIR="${BUCKET}/${MODEL_NAME}/${JOB_ID}_eval"declare PIPELINE_CONFIG_PATH="${FOLDER}/faster_rcnn_inception_resnet_v2_atrous_coco_cloud.config"declare PIPELINE_YAML="/Users/Ben/Documents/DeepMeerkat/training/Detection/cloud.yml"
我的yaml文件看起来像这样
trainingInput: runtimeVersion: "1.0" scaleTier: CUSTOM masterType: standard_gpu workerCount: 5 workerType: standard_gpu parameterServerCount: 3 parameterServerType: standard
相关路径已在配置中设置,例如
fine_tune_checkpoint: "gs://api-project-773889352370-ml/DeepMeerkatDetection/checkpoint/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"
我使用setup.py打包了对象检测和slim
运行
gcloud ml-engine jobs submit training "${JOB_ID}_train" \ --job-dir=${TRAIN_DIR} \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.train \ --region us-central1 \ --config ${PIPELINE_YAML} \ -- \ --train_dir=${TRAIN_DIR} \ --pipeline_config_path= ${PIPELINE_CONFIG_PATH}
会产生一个tensorflow(导入?)错误。这有点神秘
insertId: "1inuq6gg27fxnkc" logName: "projects/api-project-773889352370/logs/ml.googleapis.com%2FDeepMeerkatDetection_20171017_141321_train" receiveTimestamp: "2017-10-17T21:38:34.435293164Z" resource: {…} severity: "ERROR" textPayload: "The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 145, in main model_config, train_config, input_config = get_configs_from_multiple_files() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 127, in get_configs_from_multiple_files text_format.Merge(f.read(), train_config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 112, in read return pywrap_tensorflow.ReadFromStream(self._read_buf, length, status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status))FailedPreconditionError: .
我在其他与在机器学习引擎上进行预测相关的问题中看到过这个错误,这表明这个错误可能(?)与对象检测代码没有直接关系,但感觉像是打包不正确,缺少依赖项?我已经更新了我的gcloud到最新版本。
Bens-MacBook-Pro:research ben$ gcloud --versionGoogle Cloud SDK 175.0.0bq 2.0.27core 2017.10.09gcloud gsutil 4.27
很难看出它与这里的问题有什么关系
在使用自己的模型运行TF对象检测API时出现FailedPreconditionError
为什么代码在云端需要不同的初始化方式?
更新#1.
有趣的是,eval.py运行得很好,所以这不可能是配置文件的路径问题,或者是train.py和eval.py共有的任何问题。Eval.py耐心地等待模型检查点被创建。
另一个想法可能是检查点在上传过程中被损坏了。我们可以通过绕过并从头开始训练来测试这一点。
在.config中
from_detection_checkpoint: false
这会产生相同的预设条件错误,所以不可能是模型的问题。
回答: