Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Update Training and Serving Merlin on AWS SageMaker using latest merlin image #1039

Open
rnyak opened this issue Jul 3, 2023 · 5 comments
Assignees
Labels
bug Something isn't working P1 Priority 1
Milestone

Comments

@rnyak
Copy link
Contributor

rnyak commented Jul 3, 2023

Bug description

One of our customers are having an issue of reproducing the training and serving Merlin on AWS SM example and they get an error (will be provided eventually).

The documentation also should be improved/clarified since it is not clear how one can generate the dataset in Generating Dataset without installing Merlin libs, and using Merlin image.

Steps/Code to reproduce bug

The notebooks should be tested with the latest stable merlin-tensorflow image, and updated if required. Currently, in the example merlin-tensorflow:22.10 image is used.

Expected behavior

Environment details

  • Merlin version:
  • Platform:
  • Python version:
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?):

Additional context

@rnyak rnyak added bug Something isn't working P1 Priority 1 labels Jul 3, 2023
@rnyak rnyak added this to the Merlin 23.07 milestone Jul 3, 2023
@edknv
Copy link
Contributor

edknv commented Jul 3, 2023

I'm working on updating the merlin-tensorflow image to 23.06 here: #1040.

After bumping the image version to 23.06 and updating the processing workflow in train.py to reflect recent changes, and running the updated example on AWS, we are getting an error:

Failed to transform operator <merlin.systems.dag.runtimes.triton.ops.workflow.TransformWorkflowTriton object at 0x7fe7df82a160>
RuntimeError: Failed for execute the inference request. Model '0_transformworkflowtriton' is not ready.

which doesn't tell us much what is going wrong. I'll try to run the container locally to debug.

@rnyak
Copy link
Contributor Author

rnyak commented Jul 3, 2023

I'm working on updating the merlin-tensorflow image to 23.06 here: #1040.

After bumping the image version to 23.06 and updating the processing workflow in train.py to reflect recent changes, and running the updated example on AWS, we are getting an error:

Failed to transform operator <merlin.systems.dag.runtimes.triton.ops.workflow.TransformWorkflowTriton object at 0x7fe7df82a160>
RuntimeError: Failed for execute the inference request. Model '0_transformworkflowtriton' is not ready.

which doesn't tell us much what is going wrong. I'll try to run the container locally to debug.

thanks @edknv !

@wei-m-teh
Copy link

@edknv are there any update on this issue? I am trying to deploy a Merlin model to Sagemaker following the example given. I am running into the same issue.

@edknv
Copy link
Contributor

edknv commented Oct 26, 2023

@wei-m-teh Apologies for the delay. It's in review at the moment, but I updated #1040 with a workaround I found for making the notebook work with the latest 23.08 image.

@rnyak
Copy link
Contributor Author

rnyak commented Nov 2, 2023

@wei-m-teh can you please test this PR at your end? thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Priority 1
Projects
None yet
Development

No branches or pull requests

3 participants