Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: docker containers don't crash #188

Open
haf opened this issue Feb 10, 2023 · 11 comments
Open

[BUG]: docker containers don't crash #188

haf opened this issue Feb 10, 2023 · 11 comments
Assignees
Labels
Infrastructure Related to Infrastructure

Comments

@haf
Copy link

haf commented Feb 10, 2023

Describe the bug
If the database, or redis, is unavailable, the docker containers don't crash. This stops them from auto-healing (e.g. DNS recovering and injecting the env var).

To Reproduce
Steps to reproduce the behavior:

  1. Deploy e.g. api-worker, the database, but leave out the redis URI
  2. The api-worker will now crash-loop internally and log a lot of output (= $$$) without actually crashing the Kubernetes pod.
api-worker-79b46fdf88-7nrnw api-worker W, [2023-02-10T13:33:44.322529 #7]  WARN -- : /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:398:in `rescue in establish_connection'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:379:in `establish_connection'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:115:in `block in connect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:344:in `with_reconnect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:114:in `connect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/prepend.rb:25:in `block in connect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:10:in `block in connect_with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:55:in `block in with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/tracer.rb:356:in `capture_segment_error'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:55:in `with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:10:in `connect_with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/prepend.rb:25:in `connect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:417:in `ensure_connected'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:269:in `block in process'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:356:in `logging'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:78:in `block in logging'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:17:in `block in instrument'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:28:in `record_span'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:16:in `instrument'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:77:in `logging'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:268:in `process'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:161:in `call'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/prepend.rb:17:in `block in call'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:17:in `block in call_with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:55:in `block in with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/tracer.rb:356:in `capture_segment_error'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:55:in `with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:17:in `call_with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/prepend.rb:17:in `call'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis.rb:269:in `block in send_command'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis.rb:268:in `synchronize'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis.rb:268:in `send_command'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/commands/sets.rb:11:in `scard'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/api.rb:867:in `block in size'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq.rb:156:in `block in redis'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:65:in `block (2 levels) in with'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `handle_interrupt'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `block in with'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `handle_interrupt'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `with'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq.rb:153:in `redis'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/api.rb:867:in `size'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/scheduled.rb:190:in `process_count'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/scheduled.rb:151:in `random_poll_interval'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/scheduled.rb:120:in `wait'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/scheduled.rb:102:in `block in start'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/component.rb:8:in `watchdog'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/component.rb:17:in `block in safe_thread'

Expected behavior
If the environment a software service runs in is incorrectly configured, print an error message and exit the process. This lets monitoring software alert to the problem. Otherwise ops has to write app-specific event listeners, eventify the logs (including writing stacktrace parsers for ruby) and deploy listeners for the logs, then write API-integrations with the platform runtime (like kubernetes) to restart the pod.

Support

  • Version: getlago/api:v0.22.0-beta
@vincent-pochet
Copy link
Contributor

Hi @haf - thank you for your message!

It's definitely something that should be improved. We will investigate and get back to you with a fix as soon as possible.

@jdenquin jdenquin self-assigned this Feb 13, 2023
@jdenquin jdenquin added the Infrastructure Related to Infrastructure label Feb 13, 2023
@OscarKolsrud
Copy link

Have we any status on this? Seem to be running into something sort of related where the API suddendly returns 500, and a compose down/up fixes it

@jdenquin
Copy link
Contributor

@OscarKolsrud we have to dig it a bit since this is the way Rails/Sidekiq works today, we may have to make it customizable.
On our side, on production, we prefer to have errors instead of pods always restarting, but it's a Rubyist habit I think 😂

@gabrielseibel1
Copy link

Hi, my org also noticed this problem, we ended up having a lot of problems because of the lago clock being frozen on error state for 2 days after a redis disconnect (we noticed that it just stopped logging after the error, although running, but customers weren't charged). It's the second or third time we lose a bill-customers day (we suspect) because of this type of outage (pods in error state in last/first days of months). Do you have updates on this, or any suggestion of a mechanism to restart pods once they enter error state?

@haf
Copy link
Author

haf commented Feb 13, 2024

https://en.wikipedia.org/wiki/Crash-only_software - this is a reference to a very sane way of building software - where you auto-correct faults by restarting at different levels.

@jdenquin
Copy link
Contributor

@gabrielseibel1 if Redis is your problem, restarting the pod will not fix the issue, it will restart since Redis will be available again.
What are the errors you face on the worker when it's in error?

@haf
Copy link
Author

haf commented Feb 13, 2024

If the connectivity to redis is a problem, a restart will trigger a retry of the connection

@jdenquin
Copy link
Contributor

each time a job is enqueued or want to run, the connection to redis is retried.
Is the health check enough for you guys? If the database or redis is out, health endpoint will return an error

@haf
Copy link
Author

haf commented Feb 13, 2024

If it's done using a liveness probe in k8s it would have solved my problem. If you've solved the bug from the stacktrace in the original post in this thread, you can close the issue. Do note that this stacktrace happens on container start though; not after a while / for networking errors - so in the case of this issue - you'd never have a successful health check.

@jdenquin
Copy link
Contributor

I'm currently working on the liveness probe on our helmchart so this is definitely something we'll release very soon!

@doctorpangloss
Copy link

it's been a year. the clock crashes often. how do you guys deal with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure Related to Infrastructure
Projects
None yet
Development

No branches or pull requests

6 participants