最近几天使用Sagemaker内置的随机切割森林算法遇到了问题。
我想在训练过程中对模型进行验证,但可能有些地方我没有完全理解。
首先,仅使用训练通道进行拟合是可以正常工作的:
container=sagemaker.image_uris.retrieve("randomcutforest", region, "us-east-1")print(container)rcf = sagemaker.estimator.Estimator( image_uri=container, role=role, instance_count=1, sagemaker_session=sagemaker.Session(), instance_type="ml.m4.xlarge", data_location=f"s3://{bucket}/{prefix}/", output_path=f"s3://{bucket}/{prefix}/output")rcf.set_hyperparameters( feature_dim = 116, eval_metrics = 'precision_recall_fscore', num_samples_per_tree=256, num_trees=100, )train_data = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='text/csv;label_size=0', distribution='ShardedByS3Key')rcf.fit({'train': train_data})
[06/28/2021 09:45:24 INFO 140226936620864] 未提供测试数据。#metrics {"StartTime": 1624873524.6154933, "EndTime": 1624873524.6156445, "Dimensions": {"Algorithm": "RandomCutForest", "Host": "algo-1", "Operation": "training"}, "Metrics": {"setuptime": {"sum": 40.169477462768555, "count": 1, "min": 40.169477462768555, "max": 40.169477462768555}, "totaltime": {"sum": 13035.491704940796, "count": 1, "min": 13035.491704940796, "max": 13035.491704940796}}}2021-06-28 09:45:50 已完成 - 训练任务已完成ProfilerReport-1624873226: NoIssuesFound训练秒数: 78计费秒数: 78
但是,当我想在训练过程中验证模型时:
train_data = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='text/csv;label_size=0', distribution='ShardedByS3Key')val_data = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='text/csv;label_size=1', distribution='FullyReplicated')rcf.fit({'train': train_data, 'validation': val_data}, wait=True)
我得到了以下错误:
AWS Region: us-east-1RoleArn: arn:aws:iam::517714493426:role/service-role/AmazonSageMaker-ExecutionRole-20210409T152960382416733822.dkr.ecr.us-east-1.amazonaws.com/randomcutforest:12021-06-28 10:14:12 开始 - 启动训练任务...2021-06-28 10:14:14 开始 - 启动请求的ML实例ProfilerReport-1624875252: InProgress......2021-06-28 10:15:27 开始 - 准备训练实例.........2021-06-28 10:17:07 下载中 - 下载输入数据...2021-06-28 10:17:27 训练中 - 下载训练镜像..Docker入口点被调用,参数为:train运行默认环境配置脚本[06/28/2021 10:17:53 INFO 140648505521984] 从/opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json读取默认配置: {'num_samples_per_tree': 256, 'num_trees': 100, 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999}[06/28/2021 10:17:53 INFO 140648505521984] 与/opt/ml/input/config/hyperparameters.json中提供的配置合并: {'num_trees': '100', 'num_samples_per_tree': '256', 'feature_dim': '116', 'eval_metrics': 'precision_recall_fscore'}[06/28/2021 10:17:53 INFO 140648505521984] 最终配置: {'num_samples_per_tree': '256', 'num_trees': '100', 'force_dense': 'true', 'eval_metrics': 'precision_recall_fscore', 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999, 'feature_dim': '116'}[06/28/2021 10:17:53 ERROR 140648505521984] 客户错误:无法初始化算法。无法验证输入数据配置。 (由ValidationError引起)由以下原因引起:不允许额外的属性('validation'是意外的)在架构中验证'additionalProperties'失败: {'$schema': 'http://json-schema.org/draft-04/schema#', 'additionalProperties': False, 'definitions': {'data_channel_replicated': {'properties': {'ContentType': {'type': 'string'}, 'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'}, 'S3DistributionType': {'$ref': '#/definitions/s3_replicated_type'}, 'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}}, 'type': 'object'}, 'data_channel_sharded': {'properties': {'ContentType': {'type': 'string'}, 'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'}, 'S3DistributionType': {'$ref': '#/definitions/s3_sharded_type'}, 'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}}, 'type': 'object'}, 'record_wrapper_type': {'enum': ['None', 'Recordio'], 'type': 'string'}, 's3_replicated_type': {'enum': ['FullyReplicated'], 'type': 'string'}, 's3_sharded_type': {'enum': ['ShardedByS3Key'], 'type': 'string'}, 'training_input_mode': {'enum': ['File', 'Pipe'], 'type': 'string'}}, 'properties': {'state': {'$ref': '#/definitions/data_channel'}, 'test': {'$ref': '#/definitions/data_channel_replicated'}, 'train': {'$ref': '#/definitions/data_channel_sharded'}}, 'required': ['train'], 'type': 'object'}在实例上: {'train': {'ContentType': 'text/csv;label_size=0', 'RecordWrapperType': 'None', 'S3DistributionType': 'ShardedByS3Key', 'TrainingInputMode': 'File'}, 'validation': {'ContentType': 'text/csv;label_size=1', 'RecordWrapperType': 'None', 'S3DistributionType': 'FullyReplicated', 'TrainingInputMode': 'File'}}2021-06-28 10:18:10 上传中 - 上传生成的训练模型2021-06-28 10:18:10 失败 - 训练任务失败ProfilerReport-1624875252: Stopping---------------------------------------------------------------------------UnexpectedStatusException Traceback (most recent call last)<ipython-input-34-c624ace00c69> in <module> 33 34 ---> 35 rcf.fit({'train': train_data, 'validation': val_data}, wait=True)~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config) 680 self.jobs.append(self.latest_training_job) 681 if wait:--> 682 self.latest_training_job.wait(logs=logs) 683 684 def _compilation_job_name(self):~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs) 1623 # 如果请求日志,则调用logs_for_jobs。 1624 if logs != "None":-> 1625 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) 1626 else: 1627 self.sagemaker_session.wait_for_job(self.job_name)~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type) 3679 3680 if wait:-> 3681 self._check_job_status(job_name, description, "TrainingJobStatus") 3682 if dot: 3683 print()~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 3243 ), 3244 allowed_statuses=["Completed", "Stopped"],-> 3245 actual_status=status, 3246 ) 3247 UnexpectedStatusException: 训练任务randomcutforest-2021-06-28-10-14-12-783的错误:失败。原因:ClientError: 无法初始化算法。无法验证输入数据配置。 (由ValidationError引起)由以下原因引起:不允许额外的属性('validation'是意外的)在架构中验证'additionalProperties'失败: {'$schema': 'http://json-schema.org/draft-04/schema#', 'additionalProperties': False, 'definitions': {'data_channel_replicated': {'properties': {'ContentType': {'type': 'string'}, 'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'}, 'S3DistributionType': {'$ref': '#/definitions/s3_replicated_type'}, 'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}}, 'type': 'object'}, 'data_channel_sharded': {'properties': {'ContentType': {'type': 'string'},
有人能帮我正确实现训练过程中的验证吗?这将是我能得到的最好结果。:-D
亲切的问候,Christina
回答:
我找到了错误:你需要将通道命名为’test’而不是’validation’,这样就能工作:rcf.fit({‘train’: train_data, ‘test’: test_data}, wait=True)