Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom model_fn function not found when extending the PyTorch inference container #86

Open
e13h opened this issue Jun 22, 2021 · 0 comments

Comments

@e13h
Copy link

e13h commented Jun 22, 2021

Background

I am trying to do single-model batch transform in SageMaker to get predictions from a pre-trained model (I did not train the model on SageMaker). My end goal is to be able to run just a bit of python code to start a batch transform job and grab the results from S3 when it's done.

import boto3
client = boto3.client("sagemaker")
client.create_transform_job(...)

# occasionally monitor the job
client.describe_transform_job(...)

# fetch results once job is finished
client = boto3.client("s3")
...

I can successfully get the results I need using Transformer.transform() in a SageMaker notebook instance (see the appendix below for code snippets), but in my project I do not want to depend on the SageMaker Python SDK. Instead, I'd rather use boto3 like in the pseudocode above.

The issue

I referenced this example notebook to try and extend a PyTorch inference container (see appendix below for the dockerfile I am using), but I can't get the same results that I can when I use the SageMaker Python SDK in a notebook instance. Instead I get this error:

Backend worker process died.
Traceback (most recent call last):
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 182, in <module>
        worker.run_server()
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 154, in run_server
        self.handle_connection(cl_socket)
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 116, in handle_connection
        service, result, code = self.load_model(msg)
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 89, in load_model
        service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
    File "/opt/conda/lib/python3.6/site-packages/ts/model_loader.py", line 110, in load
        initialize_fn(service.context)
    File "/home/model-server/tmp/models/d00cc5c716dc4e4582250bd89915b99b/handler_service.py", line 51, in initialize
        super().initialize(context)
    File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
        self._service.validate_and_initialize(model_dir=model_dir)
    File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 158, in validate_and_initialize
        self._model = self._model_fn(model_dir)
    File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 55, in default_model_fn
        NotImplementedError:
            Please provide a model_fn implementation.
            See documentation for model_fn at https://github.com/aws/sagemaker-python-sdk

The problem seems to be that when the inference toolkit tries to import a customized inference.py script, it can't find it, presumably because /opt/ml/model/code is not found in sys.path.

if find_spec(user_module_name) is not None:

If I understand the code correctly, then in this snippet below (which runs before the snippet above), we are attempting to add the code_dir to sys.path, but this won't affect the current runtime.

# add model_dir/code to python path
code_dir_path = "{}:".format(model_dir + "/code")
if PYTHON_PATH_ENV in os.environ:
os.environ[PYTHON_PATH_ENV] = code_dir_path + os.environ[PYTHON_PATH_ENV]
else:
os.environ[PYTHON_PATH_ENV] = code_dir_path

I wonder if it should be like this instead:

import sys
from sagemaker_inference.environment import code_dir
...
# add model_dir/code to python path 
if code_dir not in sys.path:
    sys.path.append(code_dir)

Appendix

Notebook cells containing code I was able to run successfully

Here's what I can get running in a SageMaker notebook instance (ml.p2.xlarge). The last cell takes about 5 minutes to run.

from sagemaker import get_execution_role
from sagemaker.pytorch.model import PyTorchModel

# fill out proper values here
path_to_model = "s3://bucket/path/to/model/model.tar.gz"

repo = "GITHUB_REPO_URL_HERE"
branch = "BRANCH_NAME_HERE"
token = "GITHUB_PAT_HERE"

path_to_code_location = "s3://bucket/path/to/code/location"
github_repo_source_dir = "relative/path/to/entry/point"

path_to_output = "s3://bucket/path/to/output"
path_to_input = "s3://bucket/path/to/input"
pytorch_model = PyTorchModel(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.4-gpu-py36",  # the latest supported version I could get working
    model_data=path_to_model,
    git_config={
        "repo": repo,
        "branch": branch,
        "token": token,
    },
    code_location=path_to_code_location,  # must provide this so that a default bucket isn't created
    source_dir=github_repo_source_dir,
    entry_point="inference.py",
    role=get_execution_role(),
    py_version="py3",
    framework_version="1.4",  # must provide this even though we are supplying `image_uri`
)
transformer = pytorch_model.transformer(
    instance_count=1,
    instance_type="local_gpu",
    strategy="SingleRecord",
    output_path=path_to_output,
    accept="image/png",
)
transformer.transform(
    data=path_to_input,
    data_type="S3Prefix",
    content_type="image/png",
    compression_type=None,
    wait=True,
    logs=True,
)

Dockerfile for extended container

# Tutorial for extending AWS SageMaker PyTorch containers:
# https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb
ARG REGION=us-west-2

# SageMaker PyTorch Image
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-inference:1.8.1-gpu-py36-cu111-ubuntu18.04

ARG CODE_DIR=/opt/ml/model/code
ENV PATH="${CODE_DIR}:${PATH}"

# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /inference ${CODE_DIR}

# Used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY ${CODE_DIR}

# Used by the SageMaker PyTorch container to determine our program entry point.
# For more information: https://github.com/aws/sagemaker-pytorch-container
ENV SAGEMAKER_PROGRAM inference.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant