Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad does not move jobs out of client node with failed docker driver (service stopped) #3776

Closed
Garagoth opened this issue Jan 19, 2018 · 4 comments

Comments

@Garagoth
Copy link

Nomad version

0.7.1

Operating system and Environment details

CentOS 7, Docker version 17.09.0-ce, build afdb6d4
Running on a 4 node cluster (8 CPU, 64GB RAM each)

Issue

Nomad, when docker daemon is stopped on client node, does not try to move tasks out to another node.
I do have a attrs.driver.docker = 1 constraint, nomad properly recognized driver failure (at least logs this) but tries to restart taks on same node over and over again.

Either move tasks and try to start elsewhere, or maybe add client health check that will set attrs.driver.docker = 0 so constraints can kick in? (and shouldn't driver constraint be automatic, since nomad knows what driver is to be used?)

42 seconds | 0 seconds | Restarting | Restart within policy |   | 0
-- | -- | -- | -- | -- | --
25 seconds | 16 seconds | Driver | Downloading image sorintlab/stolon:master-pg9.6 |   | 0
25 seconds | 0 seconds | Driver Failure | failed  to initialize task "sentinel-service" for alloc  "cd68c665-049e-70e2-e75b-65aa25372ed0": Failed to pull  `sorintlab/stolon:master-pg9.6`: dial unix /var/run/docker.sock:  connect: no such file or directory |   | 0
25 seconds | 0 seconds | Restarting | Exceeded allowed attempts, applying a delay |   | 0
16 seconds | 9 seconds | Driver | Downloading image sorintlab/stolon:master-pg9.6 |   | 0
16 seconds | 0 seconds | Driver Failure | failed  to initialize task "sentinel-service" for alloc  "cd68c665-049e-70e2-e75b-65aa25372ed0": Failed to pull  `sorintlab/stolon:master-pg9.6`: dial unix /var/run/docker.sock:  connect: no such file or directory |   | 0
16 seconds | 0 seconds | Restarting | Restart within policy |   | 0
0 seconds | 15 seconds | Driver | Downloading image sorintlab/stolon:master-pg9.6 |   | 0
0 seconds | 0 seconds | Driver Failure | failed  to initialize task "sentinel-service" for alloc  "cd68c665-049e-70e2-e75b-65aa25372ed0": Failed to pull  `sorintlab/stolon:master-pg9.6`: dial unix /var/run/docker.sock:  connect: no such file or directory |   | 0
0 seconds | 0 seconds | Restarting | Restart within policy

When docker daemon is started again thise tasks are started. But with docker failure on one client node nomad does not met task count (but it could)

@preetapan
Copy link
Member

@Garagoth Thanks for reporting this. We are addressing rescheduling of failed allocations in the upcoming Nomad 0.8 release. Reschedule attempts and time intervals will be made configurable as well.

@chelseakomlo
Copy link
Contributor

chelseakomlo commented Jan 19, 2018

Hi, thanks for opening this issue. To add to what @preetapan mentioned, we will also add in Nomad 0.8 the concept of ongoing driver health checks, so that if a driver fails, the client will stop advertising this driver until it becomes healthy again.

@preetapan
Copy link
Member

This was addressed with rescheduling in 0.8.

@github-actions
Copy link

github-actions bot commented Dec 1, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants