Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] User cannot deploy Merlin image >=23.04 on Azure Databricks #1055

Open
rnyak opened this issue Jul 12, 2023 · 0 comments
Open

[BUG] User cannot deploy Merlin image >=23.04 on Azure Databricks #1055

rnyak opened this issue Jul 12, 2023 · 0 comments
Labels
bug Something isn't working P1 Priority 1
Milestone

Comments

@rnyak
Copy link
Contributor

rnyak commented Jul 12, 2023

Bug description

The user reported this error when they try to deploy merlin-tensorflow image >= 23.04. They are able to deploy merlin-tensorflow:23.02 image on Azure databricks. One main different is cuda versions in these images.

Spark driver could not be reached on startup. This issue can be caused by invalid Spark configurations or malfunctioning [init scripts](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fazure%2Fdatabricks%2Fclusters%2Finit-scripts%23global-and-cluster-named-init-script-logs&data=05%7C01%7Cronaya%40nvidia.com%7Cfe78a893b81e491de97208db82eee73e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638247734960282987%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=inGDUr3qE2Xy%2BYdYVbF6C39%2BCH4syUZkTOOgaRvk6J4%3D&reserved=0). Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.

Internal error message: Spark failed to start: Could not connect to driver instance. Possible reason: network misconfiguration.

Steps/Code to reproduce bug

Expected behavior

Environment details

  • Merlin version:
  • Platform:
  • Python version:
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?):

Additional context

An eng from Rapids team did some debugging about the spark cluster issue that this user is facing with merlin-tensorflow:23.04 image. Rapids eng spent some time converting the instructions from https://docs.databricks.com/clusters/custom-containers.html#option-2-build-your-own-docker-base into some tests that we can run with container canary:

https://github.com/NVIDIA/container-canary/blob/main/examples/databricks.yaml

Here are some quick notes on running the test:

https://gist.github.com/jacobtomlinson/73f30f5657a370e7ed2a559b0eb7123f

@rnyak rnyak added bug Something isn't working P0 labels Jul 12, 2023
@rnyak rnyak added this to the Merlin 23.07 milestone Jul 12, 2023
@viswa-nvidia viswa-nvidia added P1 Priority 1 and removed P0 labels Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Priority 1
Projects
None yet
Development

No branches or pull requests

2 participants