wait_until_cluster_is_ready not timing out on start task failure #621

AndreiPopescuSK · 2018-06-28T12:51:32Z

Hello,

I'm using the SDK (v0.8.0) to spin-up an AZTK cluster. I'm also using a custom docker image, and on one instance I forgot to pass the docker registry credentials, which led to all node start tasks failing.

I would expect that in this instance, wait_until_cluster_is_ready should timeout after failing to bring up a master node after WAIT_FOR_MASTER_TIMEOUT seconds, or notice that the master start task failed. Unfortunately, this does not happen and cluster spin-up hangs indefinitely.

Presumably this is because this loop never terminates, as this line is always run. Maybe if the master start task fails, a master_node_id is never given to the cluster, so it gets stuck there?

Any idea if this is the case? Thank you for the help.

jafreck · 2018-06-28T18:00:15Z

you are correct that if all start tasks fail early enough that a master will never be elected (so no master_node_id will be set), and that loop will hang. I think the best solution here might be to check if all nodes have entered StartTaskFailed, and exit. Adding a timeout is another good option.

Thanks for pointing this out!

jafreck added this to the v0.9.0 milestone Jun 28, 2018

jafreck added the bug label Jun 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait_until_cluster_is_ready not timing out on start task failure #621

wait_until_cluster_is_ready not timing out on start task failure #621

AndreiPopescuSK commented Jun 28, 2018

jafreck commented Jun 28, 2018

wait_until_cluster_is_ready not timing out on start task failure #621

wait_until_cluster_is_ready not timing out on start task failure #621

Comments

AndreiPopescuSK commented Jun 28, 2018

jafreck commented Jun 28, 2018