Skip to content
This repository has been archived by the owner on Feb 3, 2021. It is now read-only.

wait_until_cluster_is_ready not timing out on start task failure #621

Open
AndreiPopescuSK opened this issue Jun 28, 2018 · 1 comment
Open
Labels
Milestone

Comments

@AndreiPopescuSK
Copy link

Hello,

I'm using the SDK (v0.8.0) to spin-up an AZTK cluster. I'm also using a custom docker image, and on one instance I forgot to pass the docker registry credentials, which led to all node start tasks failing.

I would expect that in this instance, wait_until_cluster_is_ready should timeout after failing to bring up a master node after WAIT_FOR_MASTER_TIMEOUT seconds, or notice that the master start task failed. Unfortunately, this does not happen and cluster spin-up hangs indefinitely.

Presumably this is because this loop never terminates, as this line is always run. Maybe if the master start task fails, a master_node_id is never given to the cluster, so it gets stuck there?

Any idea if this is the case? Thank you for the help.

@jafreck
Copy link
Member

jafreck commented Jun 28, 2018

you are correct that if all start tasks fail early enough that a master will never be elected (so no master_node_id will be set), and that loop will hang. I think the best solution here might be to check if all nodes have entered StartTaskFailed, and exit. Adding a timeout is another good option.

Thanks for pointing this out!

@jafreck jafreck added this to the v0.9.0 milestone Jun 28, 2018
@jafreck jafreck added the bug label Jun 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants