Redis connection loss resiliency #390

Vince-Chenal · 2024-01-09T15:32:49Z

Description

Trying to solve this issue: #337

Commit by commit:

Stop checking redis connectivity at application startup
Add a background routine that ping redis and maintain its status
Serve requests from cache only when cache is seen as alive
Add a cache_alive metric to enable monitoring cache connectivity
Update rediscache wrong instantiation test according to 1.

Remark: the method serveFromCache already fallback to proxying requests to Clickhouse (after some time...) in case of error so this PR just add a layer in front of this mechanism.

I did not know what tests to add so feel free to tell me

Pull request type

Please check the type of change your PR introduces:

Checklist

Linter passes correctly
Add tests which fail without the change (if possible)
All tests passing
Extended the README / documentation, if necessary

Does this introduce a breaking change?

Yes
No

Further comments

render · 2024-01-09T15:32:53Z

Your Render PR Server URL is https://chproxy-pr-390.onrender.com.

Follow its progress at https://dashboard.render.com/static/srv-cmeme9a1hbls738qmd0g.

…difications

Blokje5

I have a few initial comments. @mga-chka also mentioned that some metadata was stored in Redis that would be lost if we switch between Cache and no Cache fallbacks. We will need to iterate on this PR if that is still the case.

Blokje5 · 2024-01-10T09:57:17Z

cache/redis_cache.go

+func (f *redisCache) checkAlive() {
+	for {
+		select {
+		case <-f.quitAliveCheck:


A more idiomatic way IMO would be to use a cancellable context, but it is a small nitpick. A lot of the code base was written before context become the norm anyway.

Alternatively renaming the channel to done works for me as well.

I've never used cancellable context yet, I just renamed the channel for now.
But don't hesitate to tell me if needed 🙏

Blokje5 · 2024-01-10T10:03:45Z

cache/redis_cache.go

+			return
+		default:
+			f.alive = f.client.Ping(context.Background()).Err() == nil
+			time.Sleep(pingInterval)


using:

ticker := time.NewTicker(pingInterval) for { select { case <-done: return case t := <-ticker.C: .... } }

Will ensure that we ping at every ping interval, taking the time it takes to execute the Ping as well (e.g. if the ping interval is 5s and we take 1s to ping, the next tick will run after 4s).

Nice 👍 I switched to use ticker instead

Blokje5 · 2024-01-10T10:06:23Z

cache/redis_cache.go

+}
+
+func (f *redisCache) Alive() bool {
+	return f.alive


While most likely it isn't an issue now (there is only one reader and one writer). There is a race condition on the alive boolean and an atomic variable would be better suited.

This idea crossed my mind but I thought it wasn't necessary for now. I can work on something if needed

Blokje5 · 2024-01-15T14:14:28Z

So as discussed offline this PR does have an impact on concurrent transactions

As far as I can see the following could happen in the case of a concurrent transaction:
TransactionRegistry is used to maintain status of concurrent queries between multiple proxies.
Concurrent Transactions are checked in AwaitForConcurrentTransaction, which will return TransactionStatus{State: transactionAbsent} + an error in case of an error on redis.Get.

That means that transaction would fail: as the error would be returned and in the serveFromCache method we would return http.StatusInternalServerError.

However, the PR adds a guard around serveFromCache based on the state of the redis healthcheck, so this should only affect in-flight requests at the time of the healthcheck failure (and probably is acceptable).

It does mean we suddenly lose protection from the Thundering Herd problem (https://www.chproxy.org/configuration/caching/).

As a future improvement we can allow for the cache to fallback to file/in-memory caching + transaction maintenance so we can at least provide a certain level of protection even in case of Redis failure.

We can decide to use a configuration option to allow users to decide between protection from Thundering herd or protection from Redis failure.

One potential area of improvement is too also use the Alive value to short circuit certain operations of the async cache (such as AwaitForConcurrentTransaction, no need to try to reach redis) or even use it as a fallback to a simpler in-memory cache.

Vince-Chenal added 4 commits January 9, 2024 15:19

fix: stop pinging redis at startup

b59e98f

feat: add Alive method to Cache interface + redis implementation

e5c588f

feat: check cache is Alive before forwarding request

74770ed

feat: add cache_alive metric

6a9846d

fix: RedisCache_wrong_instantiation tets update following previous mo…

2721589

…difications

Blokje5 requested changes Jan 10, 2024

View reviewed changes

Vince-Chenal added 2 commits January 10, 2024 11:33

fix: redis_cache - use ticker instead of sleep

6d4cb23

fix: redis_cache - add first lifecheck at startup

64886d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis connection loss resiliency #390

Redis connection loss resiliency #390

Vince-Chenal commented Jan 9, 2024 •

edited

render bot commented Jan 9, 2024

Blokje5 left a comment

Blokje5 Jan 10, 2024

Blokje5 Jan 10, 2024

Vince-Chenal Jan 10, 2024

Blokje5 Jan 10, 2024

Vince-Chenal Jan 10, 2024

Blokje5 Jan 10, 2024

Vince-Chenal Jan 10, 2024

Blokje5 commented Jan 15, 2024

Redis connection loss resiliency #390

Are you sure you want to change the base?

Redis connection loss resiliency #390

Conversation

Vince-Chenal commented Jan 9, 2024 • edited

Description

Pull request type

Checklist

Does this introduce a breaking change?

Further comments

render bot commented Jan 9, 2024

Blokje5 left a comment

Choose a reason for hiding this comment

Blokje5 Jan 10, 2024

Choose a reason for hiding this comment

Blokje5 Jan 10, 2024

Choose a reason for hiding this comment

Vince-Chenal Jan 10, 2024

Choose a reason for hiding this comment

Blokje5 Jan 10, 2024

Choose a reason for hiding this comment

Vince-Chenal Jan 10, 2024

Choose a reason for hiding this comment

Blokje5 Jan 10, 2024

Choose a reason for hiding this comment

Vince-Chenal Jan 10, 2024

Choose a reason for hiding this comment

Blokje5 commented Jan 15, 2024

Vince-Chenal commented Jan 9, 2024 •

edited