Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Improve health check by checking redis connection #766

Open
Aohzan opened this issue Jan 24, 2024 · 6 comments
Open

[feature] Improve health check by checking redis connection #766

Aohzan opened this issue Jan 24, 2024 · 6 comments

Comments

@Aohzan
Copy link

Aohzan commented Jan 24, 2024

Hello,

We had an issue on our platform, we perform a HAProxy check on /health on centrifugo servers, which was still ok while centrifugo raised errors because Redis connection issued READONLY You can't write against a read only replica errors.
centrifugo didn't change its redis server when sentinel announced the new primary, we had the issue on 1 of 4 servers.

Describe the solution you'd like
It would be great if the health check will return an error if there is any redis connection error

Thank you

@FZambia
Copy link
Member

FZambia commented Jan 24, 2024

Hello @Aohzan

What is your Centrifugo configuration for Redis Sentinel? Do you use redis_sentinel_address? Which version of Centrifugo you have? Which Redis version? Anything specific in your setup? Asking to also understand why Centrifugo could miss master change.

@Aohzan
Copy link
Author

Aohzan commented Jan 24, 2024

centrifugo 4.1.2-0
redis-sentinel 6:7.2.4-1rl1~bullseye1 (upgraded from 5:6.0.16-1+deb11u2 before the sentinel failover)

redis configuration part in centrifugo

  "engine": "redis",                     
  "redis_sentinel_address": "localhost:26379",
  "redis_sentinel_master_name": "mymaster",

there is:

  • 4 centrifugo servers with centrifugo + sentinel
  • 3 redis servers with redis + sentinel

so 7 sentinel, the failover of the master has been handled correctly on all sentinel, but just 1 of 4 centrifugo didn't change its redis server

@FZambia
Copy link
Member

FZambia commented Jan 24, 2024

Thanks for the details. In your case, I'd concentrate on trying to fix the root cause - I suppose first step here is upgrading to the latest Centrifugo version and trying to reproduce with it – there were many improvements in the underlying Redis library since v4.1.2.

I think adding Redis connection check in health endpoint may not fully solve the problem. It may hide the problem, until no Centrifugo nodes left. I suppose you have just a Haproxy without Kubernetes? In this case Haproxy will remove failed node from the balancing, but the issue on Node will persist until someone restarts Centrifugo node. Is it right? Or it works differently? Probably there is some automation to restart failing node?

In Kubernetes liveness probe failure results into app restart – in that case Redis connection check could make more sense.

@Aohzan
Copy link
Author

Aohzan commented Jan 24, 2024

Yes, we will see to update to the latest version.

Yes, no k8s, what we want it's that the HAProxy will exclude the problematic Centrifugo from the pool as soon as it can't handle request properly due to Redis issue. So it let me the time to debug and restart Centrifugo.

@Aohzan
Copy link
Author

Aohzan commented Feb 3, 2024

Hello
We planned the upgrade, as we had the same issue on all nodes after a redis failover.
For the original request, do you have an opinion?

@FZambia
Copy link
Member

FZambia commented Feb 3, 2024

We planned the upgrade, as we had the same issue on all nodes after a redis failover.

Probably you can experiment with new version first regarding Sentinel failover to make sure it solves the problem. I mean without upgrading rest of the app. While upgrade is recommended anyway, this may help to iterate on issue faster.

For the original request, do you have an opinion?

I think optional Redis check could make sense in general, though prefer it being disabled by default – when dealing with many connections removing node from balancing may be not better that waiting for connection issues to go away. Looks like Centrifugo needs to issue some write request to Redis to make sure connection is working and writable.

UPD. Though not sure what to do with Redis Cluster case, where many Redis shards must be checked. In this perspective I'd invest to proper failover as I said before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants