[BUG]: docker containers don't crash #188

haf · 2023-02-10T13:38:11Z

Describe the bug
If the database, or redis, is unavailable, the docker containers don't crash. This stops them from auto-healing (e.g. DNS recovering and injecting the env var).

To Reproduce
Steps to reproduce the behavior:

Deploy e.g. api-worker, the database, but leave out the redis URI
The api-worker will now crash-loop internally and log a lot of output (= $$$) without actually crashing the Kubernetes pod.

api-worker-79b46fdf88-7nrnw api-worker W, [2023-02-10T13:33:44.322529 #7]  WARN -- : /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:398:in `rescue in establish_connection'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:379:in `establish_connection'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:115:in `block in connect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:344:in `with_reconnect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:114:in `connect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/prepend.rb:25:in `block in connect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:10:in `block in connect_with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:55:in `block in with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/tracer.rb:356:in `capture_segment_error'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:55:in `with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:10:in `connect_with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/prepend.rb:25:in `connect'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:417:in `ensure_connected'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:269:in `block in process'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:356:in `logging'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:78:in `block in logging'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:17:in `block in instrument'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:28:in `record_span'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:16:in `instrument'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sentry-ruby-core-5.3.1/lib/sentry/redis.rb:77:in `logging'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:268:in `process'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/client.rb:161:in `call'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/prepend.rb:17:in `block in call'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:17:in `block in call_with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:55:in `block in with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/tracer.rb:356:in `capture_segment_error'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:55:in `with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/instrumentation.rb:17:in `call_with_tracing'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/newrelic_rpm-8.15.0/lib/new_relic/agent/instrumentation/redis/prepend.rb:17:in `call'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis.rb:269:in `block in send_command'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis.rb:268:in `synchronize'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis.rb:268:in `send_command'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/redis-4.7.1/lib/redis/commands/sets.rb:11:in `scard'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/api.rb:867:in `block in size'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq.rb:156:in `block in redis'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:65:in `block (2 levels) in with'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `handle_interrupt'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:64:in `block in with'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `handle_interrupt'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/connection_pool-2.3.0/lib/connection_pool.rb:61:in `with'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq.rb:153:in `redis'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/api.rb:867:in `size'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/scheduled.rb:190:in `process_count'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/scheduled.rb:151:in `random_poll_interval'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/scheduled.rb:120:in `wait'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/scheduled.rb:102:in `block in start'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/component.rb:8:in `watchdog'
api-worker-79b46fdf88-7nrnw api-worker /usr/local/bundle/gems/sidekiq-6.5.1/lib/sidekiq/component.rb:17:in `block in safe_thread'

Expected behavior
If the environment a software service runs in is incorrectly configured, print an error message and exit the process. This lets monitoring software alert to the problem. Otherwise ops has to write app-specific event listeners, eventify the logs (including writing stacktrace parsers for ruby) and deploy listeners for the logs, then write API-integrations with the platform runtime (like kubernetes) to restart the pod.

Support

Version: getlago/api:v0.22.0-beta

The text was updated successfully, but these errors were encountered:

vincent-pochet · 2023-02-10T15:01:09Z

Hi @haf - thank you for your message!

It's definitely something that should be improved. We will investigate and get back to you with a fix as soon as possible.

OscarKolsrud · 2023-10-25T07:54:15Z

Have we any status on this? Seem to be running into something sort of related where the API suddendly returns 500, and a compose down/up fixes it

jdenquin · 2023-10-25T08:13:09Z

@OscarKolsrud we have to dig it a bit since this is the way Rails/Sidekiq works today, we may have to make it customizable.
On our side, on production, we prefer to have errors instead of pods always restarting, but it's a Rubyist habit I think 😂

gabrielseibel1 · 2023-11-01T16:10:29Z

Hi, my org also noticed this problem, we ended up having a lot of problems because of the lago clock being frozen on error state for 2 days after a redis disconnect (we noticed that it just stopped logging after the error, although running, but customers weren't charged). It's the second or third time we lose a bill-customers day (we suspect) because of this type of outage (pods in error state in last/first days of months). Do you have updates on this, or any suggestion of a mechanism to restart pods once they enter error state?

haf · 2024-02-13T09:16:59Z

https://en.wikipedia.org/wiki/Crash-only_software - this is a reference to a very sane way of building software - where you auto-correct faults by restarting at different levels.

jdenquin · 2024-02-13T09:41:50Z

@gabrielseibel1 if Redis is your problem, restarting the pod will not fix the issue, it will restart since Redis will be available again.
What are the errors you face on the worker when it's in error?

haf · 2024-02-13T10:02:22Z

If the connectivity to redis is a problem, a restart will trigger a retry of the connection

jdenquin · 2024-02-13T10:33:39Z

each time a job is enqueued or want to run, the connection to redis is retried.
Is the health check enough for you guys? If the database or redis is out, health endpoint will return an error

haf · 2024-02-13T13:57:36Z

If it's done using a liveness probe in k8s it would have solved my problem. If you've solved the bug from the stacktrace in the original post in this thread, you can close the issue. Do note that this stacktrace happens on container start though; not after a while / for networking errors - so in the case of this issue - you'd never have a successful health check.

jdenquin · 2024-02-13T14:00:57Z

I'm currently working on the liveness probe on our helmchart so this is definitely something we'll release very soon!

doctorpangloss · 2024-05-24T20:14:37Z

it's been a year. the clock crashes often. how do you guys deal with this?

jdenquin self-assigned this Feb 13, 2023

jdenquin added the Infrastructure Related to Infrastructure label Feb 13, 2023

doctorpangloss mentioned this issue May 24, 2024

[BUG]: when clock breaks and doesn't crash, clock actions are missed since the crash #354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: docker containers don't crash #188

[BUG]: docker containers don't crash #188

haf commented Feb 10, 2023

vincent-pochet commented Feb 10, 2023

OscarKolsrud commented Oct 25, 2023

jdenquin commented Oct 25, 2023

gabrielseibel1 commented Nov 1, 2023

haf commented Feb 13, 2024 •

edited

jdenquin commented Feb 13, 2024

haf commented Feb 13, 2024

jdenquin commented Feb 13, 2024

haf commented Feb 13, 2024 •

edited

jdenquin commented Feb 13, 2024

doctorpangloss commented May 24, 2024

[BUG]: docker containers don't crash #188

[BUG]: docker containers don't crash #188

Comments

haf commented Feb 10, 2023

vincent-pochet commented Feb 10, 2023

OscarKolsrud commented Oct 25, 2023

jdenquin commented Oct 25, 2023

gabrielseibel1 commented Nov 1, 2023

haf commented Feb 13, 2024 • edited

jdenquin commented Feb 13, 2024

haf commented Feb 13, 2024

jdenquin commented Feb 13, 2024

haf commented Feb 13, 2024 • edited

jdenquin commented Feb 13, 2024

doctorpangloss commented May 24, 2024

haf commented Feb 13, 2024 •

edited

haf commented Feb 13, 2024 •

edited