Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour #590

Open
kapilraju opened this issue Jan 26, 2021 · 0 comments

Comments

@kapilraju
Copy link

kapilraju commented Jan 26, 2021

We hit an issue recently where in a container we had two Container Pilot jobs, one to start a springboot java process and another for NGINX process, both of them having their own health check endpoints configured as -

            health: {
                exec: "/usr/bin/curl --fail -s -o <HEALTH CHECK ENDPOINTS>,
                interval: 10,
                ttl: 25,
                timeout: "30s"
            },

Design is, Container starts with 443 port mapped, inside the container NGINX listens on 443 and forward the request to springboot java process.

During a database outage, we saw a badly written springboot health check endpoint not returning any response and experiencing high latency, resulting into container pilot printing logs "timeout after 30s" for springboot health check endpoint.

The puzzling thing observed was if this situation continuous(i.e. springboot has not recovered) for around 1 hour 7 minutes(this is consistent behaviour with Container Pilot), container pilot starts printing the logs "timeout after 30s" for NGINX process. this NGINX process has nothing to do with database and its health check endpoint doesn't talk to any other process.

At this point if you login to container, do a curl to both endpoints you can see NGINX health check returns fine and springboot health check also returns fine (in our case it was returning after 30 sec due to underlying database issue)

From this point onwards even after database is normal, springboot is healthy, container pilot gets into this hung state and cannot recover without a restart, which means the container will never be registered to Consul even after its healthy.

Steps to reproduce -

  • Create two Container Pilot jobs, one to start a java process and another NGINX process
  • Implement a health check endpoint, add a 40 sec wait to it
  • Use timeout: "30s" in your CP config
  • Wait for 1 hour 7 minutes
@kapilraju kapilraju changed the title Containerpilot process get hung and cannot recover when health check timeouts continues for hours Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant