当我选择以下配置(config.yaml)时,遇到了内存溢出的问题:
trainingInput: scaleTier: CUSTOM masterType: large_model workerType: complex_model_m parameterServerType: large_model workerCount: 10 parameterServerCount: 10
我是在遵循Google的”criteo_tft”教程:
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/criteo_tft/config-large.yaml
那个链接说他们能够训练1TB的数据!我对此印象深刻,决定尝试一下!!!
我的数据集是分类数据,因此在进行独热编码后会生成一个相当大的矩阵(一个大小为520000 x 4000的二维numpy数组)。我在本地32GB内存的机器上能够训练我的数据集,但在云端却无法做到!!!
以下是我的错误信息:
ERROR 2017-12-18 12:57:37 +1100 worker-replica-1 Using TensorFlow backend.ERROR 2017-12-18 12:57:37 +1100 worker-replica-4 Using TensorFlow backend.INFO 2017-12-18 12:57:37 +1100 worker-replica-0 Running command: python -m trainer.task --train-file gs://my_bucket/my_training_file.csv --job-dir gs://my_bucket/my_bucket_20171218_125645ERROR 2017-12-18 12:57:38 +1100 worker-replica-2 Using TensorFlow backend.ERROR 2017-12-18 12:57:40 +1100 worker-replica-0 Using TensorFlow backend.ERROR 2017-12-18 12:57:53 +1100 worker-replica-3 Command '['python', '-m', u'trainer.task', u'--train-file', u'gs://my_bucket/my_training_file.csv', '--job-dir', u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9INFO 2017-12-18 12:57:53 +1100 worker-replica-3 Module completed; cleaning up.INFO 2017-12-18 12:57:53 +1100 worker-replica-3 Clean up finished.ERROR 2017-12-18 12:57:56 +1100 worker-replica-4 Command '['python', '-m', u'trainer.task', u'--train-file', u'gs://my_bucket/my_training_file.csv', '--job-dir', u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9INFO 2017-12-18 12:57:56 +1100 worker-replica-4 Module completed; cleaning up.INFO 2017-12-18 12:57:56 +1100 worker-replica-4 Clean up finished.ERROR 2017-12-18 12:57:58 +1100 worker-replica-2 Command '['python', '-m', u'trainer.task', u'--train-file', u'gs://my_bucket/my_training_file.csv', '--job-dir', u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9INFO 2017-12-18 12:57:58 +1100 worker-replica-2 Module completed; cleaning up.INFO 2017-12-18 12:57:58 +1100 worker-replica-2 Clean up finished.ERROR 2017-12-18 12:57:59 +1100 worker-replica-1 Command '['python', '-m', u'trainer.task', u'--train-file', u'gs://my_bucket/my_training_file.csv', '--job-dir', u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9INFO 2017-12-18 12:57:59 +1100 worker-replica-1 Module completed; cleaning up.INFO 2017-12-18 12:57:59 +1100 worker-replica-1 Clean up finished.ERROR 2017-12-18 12:58:01 +1100 worker-replica-0 Command '['python', '-m', u'trainer.task', u'--train-file', u'gs://my_bucket/my_training_file.csv', '--job-dir', u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9INFO 2017-12-18 12:58:01 +1100 worker-replica-0 Module completed; cleaning up.INFO 2017-12-18 12:58:01 +1100 worker-replica-0 Clean up finished.ERROR 2017-12-18 12:58:43 +1100 service The replica worker 0 ran out-of-memory and exited with a non-zero status of 247. The replica worker 1 ran out-of-memory and exited with a non-zero status of 247. The replica worker 2 ran out-of-memory and exited with a non-zero status of 247. The replica worker 3 ran out-of-memory and exited with a non-zero status of 247. The replica worker 4 ran out-of-memory and exited with a non-zero status of 247. To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=a_project_id........(link to to my cloud log)INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Module completed; cleaning up.INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Clean up finished.INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Module completed; cleaning up.INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Clean up finished.INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Module completed; cleaning up.INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Clean up finished.INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Module completed; cleaning up.INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Clean up finished.INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Module completed; cleaning up.INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Clean up finished.INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Module completed; cleaning up.INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Clean up finished.INFO 2017-12-18 12:59:28 +1100 service Finished tearing down TensorFlow.INFO 2017-12-18 13:00:17 +1100 service Job failed.##
请不要担心”Using TensorFlow backend.”错误,因为即使在其他较小数据集的训练作业成功时,我也会遇到这个错误。
请问有人能解释一下是什么导致了内存溢出(错误247),以及我该如何编写config.yaml文件来避免此类问题,并在云端训练我的数据吗?
回答:
我已经解决了这个问题。我需要做几件事:
-
更改tensorflow版本,特别是我在云端提交训练作业的方式。
-
我从独热编码(这会为每个新添加的项目创建一列)切换到了特征哈希。
现在它可以训练一个包含250万行和4200个编码列的分类数据集。