Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Unusable nodes #328

Open
simon-tarr opened this issue Nov 22, 2018 · 1 comment
Open

Unusable nodes #328

simon-tarr opened this issue Nov 22, 2018 · 1 comment

Comments

@simon-tarr
Copy link

Hello,

Another instance of batch pool creation unreliability, I'm afraid. Over the past 48 hours I've been unable to start any low-priority nodes using a docker image which has been fine the previous 5 days.

I'm trying to boot 8 x E64s_v3 low priority nodes. 7 will start successfully and be sat idle. One will always have the status "unusable". Every. Single. Time. I must have attempted to boot 50+ pools over the past 48 hours. I have also tried booting a few decidated nodes (only 2 or 3) and get the same issue. While booting I have also kept an eye on the node status graphs to see if nodes are being pre-empted during the boot process, which could possibly prevent a node from successfully booting. Unfortunately I've seen nothing out of the ordinary which could lead me to believe that would be an issue. I have also tried creating same size and smaller pools using different VM classes (F64s_v2, D64s_v3) with the same result. Note that I am using resource files during pool creation.

Because the node is unusable, there are no files/logs for me to view, so I can't troubleshoot the issue. If I use Batch Explorer to look what's going on, I can locate the unusable node but upon clicking it just says: "Node is currently 'unusable', there are no files to view now". I cannot reboot the node because I get a red popup warning (top right of batch explorer) which says: "Reboot failed".

As I say, everything was working fine, now it isn't. Nothing has changed my end in terms of configuring my pool, or the docker image (arcalis/nichemapr) that I've been using without issue until 2 days ago.

Thanks,
Simon

@brnleehng
Copy link
Collaborator

Hi @simon-tarr

I'll check if there's been any changes on the service side. There could have been a new deployment.

If you are on Batch Explorer, you can upload the Batch node agent logs to your Azure storage container.
The node agent logs will contain useful information about the VM and its status with the Batch service.

Pool > Node > Upload Batch logs to Storage:
Here's an image for uploading your node agent logs:
image

If you can share the node agent logs information through email (razurebatch@microsoft.com), that'll be great for diagnostic for us.

Can I get the region, pool name, and time of occurrence?

Thanks!
Brian

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants