Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model deployment is failing with the error "The primary container for production variant AllTraffic did not pass the ping health check. #401

Open
vishwath96 opened this issue Sep 30, 2020 · 5 comments

Comments

@vishwath96
Copy link

vishwath96 commented Sep 30, 2020

Trying to deploy a custom Word2Vec model that I've trained offline as a SageMaker endpoint. Followed the documentation - https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own to create docker file and everything.

I've added the following in docker file - # ENTRYPOINT ["python3", "/usr/local/bin/predictor.py"]

Looking at the logs, I am able to see that this code is running and I am able to load the model, but the model isn't getting deployed and fails with the error - Model deployment is failing with the error "The primary container for production variant AllTraffic did not pass the ping health check.

Any help?

@ajaykarpur
Copy link
Contributor

Hi @vishwath96, are you able to share your logs and the full stack trace?

@jocelynbaduria
Copy link

Hi I am having the same error. I am deploying my own dlib model. The cloud watch logs is this
What does it means?

2022/06/15 21:08:37 [error] 19#19: *1 js: failed ping{
"error": "Servable not found for request: Latest(persona-id)"
}

Kindly help. Thank you.

@priyakhokher
Copy link

priyakhokher commented Jul 28, 2022

@ajaykarpur I followed your notebook which was helpful but fails at deployment too. Here's my stacktrace - all help will be appreciated, been blocked on it for a while now. And for this error - don't understand how the model is read-only when it dumps the .pkl file in s3 perfectly fine. but when I try to deploy it

from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

I run into this error.

Starting the training.
Traceback (most recent call last):
  File "/opt/ml/train", line 55, in train
    with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:
Traceback (most recent call last): File "/opt/ml/train", line 55, in train with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:

OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/ml/train", line 72, in <module>
    train()
  File "/opt/ml/train", line 64, in train
    with open(os.path.join(output_path, 'failure'), 'w') as s:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/failure'

@ankitvirla
Copy link

@ajaykarpur I followed your notebook which was helpful but fails at deployment too. Here's my stacktrace - all help will be appreciated, been blocked on it for a while now. And for this error - don't understand how the model is read-only when it dumps the .pkl file in s3 perfectly fine. but when I try to deploy it

from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

I run into this error.

Starting the training.
Traceback (most recent call last):
  File "/opt/ml/train", line 55, in train
    with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:
Traceback (most recent call last): File "/opt/ml/train", line 55, in train with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:

OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/ml/train", line 72, in <module>
    train()
  File "/opt/ml/train", line 64, in train
    with open(os.path.join(output_path, 'failure'), 'w') as s:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/failure'

Hi, For resolving it. in docker container inside the "/opt/ml/output/" directory there should be a file with the name of failure.
And this is occurring because the training is going to be failed for some reason.

@priyakhokher
Copy link

@birla8319 the error is this statement: OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl'
and I see this puzzle under my cloudwatch logs. The model pickle files are dumped in s3 and I don't see /opt/ml/output/failure results dumped in S3 either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants