[feature] Improve health check by checking redis connection #766

Aohzan · 2024-01-24T10:05:31Z

Hello,

We had an issue on our platform, we perform a HAProxy check on /health on centrifugo servers, which was still ok while centrifugo raised errors because Redis connection issued READONLY You can't write against a read only replica errors.
centrifugo didn't change its redis server when sentinel announced the new primary, we had the issue on 1 of 4 servers.

Describe the solution you'd like
It would be great if the health check will return an error if there is any redis connection error

Thank you

The text was updated successfully, but these errors were encountered:

FZambia · 2024-01-24T10:35:27Z

Hello @Aohzan

What is your Centrifugo configuration for Redis Sentinel? Do you use redis_sentinel_address? Which version of Centrifugo you have? Which Redis version? Anything specific in your setup? Asking to also understand why Centrifugo could miss master change.

Aohzan · 2024-01-24T10:42:03Z

centrifugo 4.1.2-0
redis-sentinel 6:7.2.4-1rl1~bullseye1 (upgraded from 5:6.0.16-1+deb11u2 before the sentinel failover)

redis configuration part in centrifugo

  "engine": "redis",                     
  "redis_sentinel_address": "localhost:26379",
  "redis_sentinel_master_name": "mymaster",

there is:

4 centrifugo servers with centrifugo + sentinel
3 redis servers with redis + sentinel

so 7 sentinel, the failover of the master has been handled correctly on all sentinel, but just 1 of 4 centrifugo didn't change its redis server

FZambia · 2024-01-24T11:22:41Z

Thanks for the details. In your case, I'd concentrate on trying to fix the root cause - I suppose first step here is upgrading to the latest Centrifugo version and trying to reproduce with it – there were many improvements in the underlying Redis library since v4.1.2.

I think adding Redis connection check in health endpoint may not fully solve the problem. It may hide the problem, until no Centrifugo nodes left. I suppose you have just a Haproxy without Kubernetes? In this case Haproxy will remove failed node from the balancing, but the issue on Node will persist until someone restarts Centrifugo node. Is it right? Or it works differently? Probably there is some automation to restart failing node?

In Kubernetes liveness probe failure results into app restart – in that case Redis connection check could make more sense.

Aohzan · 2024-01-24T12:58:22Z

Yes, we will see to update to the latest version.

Yes, no k8s, what we want it's that the HAProxy will exclude the problematic Centrifugo from the pool as soon as it can't handle request properly due to Redis issue. So it let me the time to debug and restart Centrifugo.

Aohzan · 2024-02-03T16:14:26Z

Hello
We planned the upgrade, as we had the same issue on all nodes after a redis failover.
For the original request, do you have an opinion?

FZambia · 2024-02-03T17:36:50Z

We planned the upgrade, as we had the same issue on all nodes after a redis failover.

Probably you can experiment with new version first regarding Sentinel failover to make sure it solves the problem. I mean without upgrading rest of the app. While upgrade is recommended anyway, this may help to iterate on issue faster.

For the original request, do you have an opinion?

I think optional Redis check could make sense in general, though prefer it being disabled by default – when dealing with many connections removing node from balancing may be not better that waiting for connection issues to go away. Looks like Centrifugo needs to issue some write request to Redis to make sure connection is working and writable.

UPD. Though not sure what to do with Redis Cluster case, where many Redis shards must be checked. In this perspective I'd invest to proper failover as I said before

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Improve health check by checking redis connection #766

[feature] Improve health check by checking redis connection #766

Aohzan commented Jan 24, 2024

FZambia commented Jan 24, 2024 •

edited

Aohzan commented Jan 24, 2024 •

edited

FZambia commented Jan 24, 2024

Aohzan commented Jan 24, 2024

Aohzan commented Feb 3, 2024

FZambia commented Feb 3, 2024 •

edited

[feature] Improve health check by checking redis connection #766

[feature] Improve health check by checking redis connection #766

Comments

Aohzan commented Jan 24, 2024

FZambia commented Jan 24, 2024 • edited

Aohzan commented Jan 24, 2024 • edited

FZambia commented Jan 24, 2024

Aohzan commented Jan 24, 2024

Aohzan commented Feb 3, 2024

FZambia commented Feb 3, 2024 • edited

FZambia commented Jan 24, 2024 •

edited

Aohzan commented Jan 24, 2024 •

edited

FZambia commented Feb 3, 2024 •

edited