Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Instructions for Contributing to the project #85

Open
tlelson opened this issue Oct 9, 2018 · 3 comments
Open

Add Instructions for Contributing to the project #85

tlelson opened this issue Oct 9, 2018 · 3 comments

Comments

@tlelson
Copy link

tlelson commented Oct 9, 2018

I am trying to get to the bottom of a problem #413 causing my deployed tensorflow model to fail.

The model is a simple and deploys with basic instructions to GCP MLE. The serving function which errors out on sagemaker works fine on MLE.

The problem seems to be in the way the sagemaker container processes the input.


As such I have started to debug locally but I am guessing about how to do that properly and am currently unsure how the local sagemaker container assumes the role passed to the TensorFlow constructor.

Currently, I am building the latest sagemaker-tensorflow-container image at v 1.10.0 and calling it from a local notebook instance using the MNIST example provided by amazon-sagemaker-examples:

from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             framework_version='1.10.0',
                             training_steps=10, 
                             evaluation_steps=10,
                             train_instance_count=2,
                             train_instance_type='local',
                             image_name='my-sm-tensorflow:1.10.0-cpu-py2',
                            )

# mnist_estimator.fit(inputs) 
local_inputs = 'file://{}/data/'.format(os.getcwd())
mnist_estimator.fit(local_inputs)

however the local container fails because it cannot get an object from s3:

INFO:sagemaker:Creating training-job with name: my-sm-tensorflow-2018-10-08-05-34-16-185
Creating tmp6pytpo_algo-2-GGF0S_1 ...
Creating tmp6pytpo_algo-1-GGF0S_1 ...
Attaching to tmp6pytpo_algo-1-GGF0S_1, tmp6pytpo_algo-2-GGF0S_1
algo-1-GGF0S_1  | 2018-10-08 05:34:25,817 INFO - root - running container entrypoint
algo-1-GGF0S_1  | 2018-10-08 05:34:25,818 INFO - root - starting train task
algo-1-GGF0S_1  | 2018-10-08 05:34:25,841 INFO - container_support.training - Training starting
algo-2-GGF0S_1  | 2018-10-08 05:34:26,845 INFO - root - running container entrypoint
algo-2-GGF0S_1  | 2018-10-08 05:34:26,846 INFO - root - starting train task
algo-2-GGF0S_1  | 2018-10-08 05:34:26,873 INFO - container_support.training - Training starting
algo-1-GGF0S_1  | 2018-10-08 05:34:26,974 INFO - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials
algo-1-GGF0S_1  | Downloading s3://sagemaker-ap-southeast-2-167464700695/my-sm-tensorflow-2018-10-08-05-34-16-185/source/sourcedir.tar.gz to /tmp/script.tar.gz
algo-1-GGF0S_1  | 2018-10-08 05:34:27,433 ERROR - container_support.training - uncaught exception during training: An error occurred (403) when calling the HeadObject operation: Forbidden
algo-1-GGF0S_1  | Traceback (most recent call last):
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 36, in start
algo-1-GGF0S_1  |     fw.train()
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/tf_container/train_entry_point.py", line 140, in train
algo-1-GGF0S_1  |     env.download_user_module()
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/environment.py", line 89, in download_user_module
algo-1-GGF0S_1  |     cs.download_s3_resource(self.user_script_archive, tmp)
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/utils.py", line 41, in download_s3_resource
algo-1-GGF0S_1  |     script_bucket.download_file(script_key_name, target)
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 246, in bucket_download_file
algo-1-GGF0S_1  |     ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 172, in download_file
algo-1-GGF0S_1  |     extra_args=ExtraArgs, callback=Callback)
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 307, in download_file
algo-1-GGF0S_1  |     future.result()
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 73, in result
algo-1-GGF0S_1  |     return self._coordinator.result()
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 233, in result
algo-1-GGF0S_1  |     raise self._exception
algo-1-GGF0S_1  | ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
algo-1-GGF0S_1  |
algo-1-GGF0S_1  |
tmp6pytpo_algo-1-GGF0S_1 exited with code 1
Stopping tmp6pytpo_algo-2-GGF0S_1 ...
Aborting on container exit... ... done
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-9d694d5f5d5b> in <module>()
      4 # try local inputs
      5 local_inputs = 'file://{}/data/'.format(os.getcwd())
----> 6 mnist_estimator.fit(local_inputs)

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/tensorflow/estimator.pyc in fit(self, inputs, wait, logs, job_name, run_tensorboard_locally)
    248                 tensorboard.join()
    249         else:
--> 250             fit_super()
    251
    252     @classmethod

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/tensorflow/estimator.pyc in fit_super()
    230         """
    231         def fit_super():
--> 232             super(TensorFlow, self).fit(inputs, wait, logs, job_name)
    233
    234         if run_tensorboard_locally and wait is False:

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name)
    190         self._prepare_for_training(job_name=job_name)
    191
--> 192         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    193         if wait:
    194             self.latest_training_job.wait(logs=logs)

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/estimator.pyc in start_new(cls, estimator, inputs)
    432                                           resource_config=config['resource_config'], vpc_config=config['vpc_config'],
    433                                           hyperparameters=hyperparameters, stop_condition=config['stop_condition'],
--> 434                                           tags=estimator.tags)
    435
    436         return cls(estimator.sagemaker_session, estimator._current_job_name)

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/session.pyc in train(self, image, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags)
    277         LOGGER.info('Creating training-job with name: {}'.format(job_name))
    278         LOGGER.debug('train request: {}'.format(json.dumps(train_request, indent=4)))
--> 279         self.sagemaker_client.create_training_job(**train_request)
    280
    281     def tune(self, job_name, strategy, objective_type, objective_metric_name,

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/local/local_session.pyc in create_training_job(self, TrainingJobName, AlgorithmSpecification, InputDataConfig, OutputDataConfig, ResourceConfig, **kwargs)
     73         training_job = _LocalTrainingJob(container)
     74         hyperparameters = kwargs['HyperParameters'] if 'HyperParameters' in kwargs else {}
---> 75         training_job.start(InputDataConfig, hyperparameters)
     76
     77         LocalSagemakerClient._training_jobs[TrainingJobName] = training_job

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/local/entities.pyc in start(self, input_data_config, hyperparameters)
     58         self.state = self._TRAINING
     59
---> 60         self.model_artifacts = self.container.train(input_data_config, hyperparameters)
     61         self.end = datetime.datetime.now()
     62         self.state = self._COMPLETED

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/local/image.pyc in train(self, input_data_config, hyperparameters)
    124             # which contains the exit code and append the command line to it.
    125             msg = "Failed to run: %s, %s" % (compose_command, str(e))
--> 126             raise RuntimeError(msg)
    127
    128         s3_artifacts = self.retrieve_artifacts(compose_data)

RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/1x/gyr4jt_s3jqc2c88vy74btnm0000gn/T/tmp6PyTpo/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

The role can be verified to copy the object, leading me to suppose that the container does not assume the role properly.

I wonder how it is meant to assume the role?


The instructions to build the container image locally are clear, thank you for that. I would like to see something in the README.md or CONTIBUTING.md that shows the recomended process of developing the container and calling the built image locally.

@ChoiByungWook
Copy link
Contributor

Do you have docker-compose installed?

I believe the AmazonSageMakerFullAccess policy has by default an S3 condition in which the S3 bucket has to have the word sagemaker within the bucket name.

In addition for local mode, I believe since you have your AWS credentials set it should be passed properly to the container.

https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L631

Could you perhaps trying passing exporting your credentials as environment variables?

@tlelson
Copy link
Author

tlelson commented Oct 10, 2018

Hi @ChoiByungWook, I appreciate your help.

The problem is that inside the container, the ExecutionRole passed to the TensorFlow constructor is not being assumed. Let me convince you.

When debugging these issues I assume the role from my local machine and run under that role.

You can see from the stacktrace above that I have explicitly built the docker image with a credentials file of a user that is able to assume the ExecutionRole:

algo-1-GGF0S_1  | 2018-10-08 05:34:26,974 INFO - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials

After that point the HeadObject API call fails (403).

algo-1-GGF0S_1  | ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

Now I use the very same credentials file to authenticate and assume the intended role.

Observe the following executed with the awscli from my own machine:

(general) tim@tim:.aws ❯ aws  sts get-caller-identity
{
    "UserId": "AROAIL7MXNDJIECXXGR34:botocore-session-1539129598",
    "Account": "167464700695",
    "Arn": "arn:aws:sts::167464700695:assumed-role/AmazonSageMaker-ExecutionRole-20180907T092630/botocore-session-1539129598"
}
(general) tim@tim:.aws ❯ aws s3api head-object --bucket sagemaker-ap-southeast-2-167464700695 --key tims-sm-tensorflow-2018-10-08-05-34-16-185/source/sourcedir.tar.gz
{
    "AcceptRanges": "bytes",
    "LastModified": "Mon, 08 Oct 2018 05:34:20 GMT",
    "ContentLength": 1495,
    "ETag": "\"609b922764cac00005bbe2d6dfa17475\"",
    "ContentType": "binary/octet-stream",
    "Metadata": {}
}

I hope this makes it clear that the role has permissions to HeadObject.


I am sure that I am simply not running the local container properly. Perhaps there are requirements about the notebook environment that instantiate the TensorFlow object and what role it has. I suspect we will discover that in time. This is why I would like to see the official recommendations for how to develop the container locally.

Once again, thanks for your help. I think Sagemaker has fantastic potential and am quite keen to contribute.

@icywang86rui
Copy link
Contributor

I agree that i's confusing that the role passed in TensorFlow estimator is not actually used in the containers with local mode. As mentioned in aws/sagemaker-python-sdk#413, we will update our document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants