-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redis connection loss resiliency #390
base: master
Are you sure you want to change the base?
Conversation
Your Render PR Server URL is https://chproxy-pr-390.onrender.com. Follow its progress at https://dashboard.render.com/static/srv-cmeme9a1hbls738qmd0g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few initial comments. @mga-chka also mentioned that some metadata was stored in Redis that would be lost if we switch between Cache and no Cache fallbacks. We will need to iterate on this PR if that is still the case.
cache/redis_cache.go
Outdated
func (f *redisCache) checkAlive() { | ||
for { | ||
select { | ||
case <-f.quitAliveCheck: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more idiomatic way IMO would be to use a cancellable context, but it is a small nitpick. A lot of the code base was written before context become the norm anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively renaming the channel to done
works for me as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never used cancellable context yet, I just renamed the channel for now.
But don't hesitate to tell me if needed 🙏
cache/redis_cache.go
Outdated
return | ||
default: | ||
f.alive = f.client.Ping(context.Background()).Err() == nil | ||
time.Sleep(pingInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using:
ticker := time.NewTicker(pingInterval)
for {
select {
case <-done:
return
case t := <-ticker.C:
....
}
}
Will ensure that we ping at every ping interval, taking the time it takes to execute the Ping as well (e.g. if the ping interval is 5s and we take 1s to ping, the next tick will run after 4s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍 I switched to use ticker instead
} | ||
|
||
func (f *redisCache) Alive() bool { | ||
return f.alive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While most likely it isn't an issue now (there is only one reader and one writer). There is a race condition on the alive boolean and an atomic variable would be better suited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This idea crossed my mind but I thought it wasn't necessary for now. I can work on something if needed
So as discussed offline this PR does have an impact on concurrent transactions As far as I can see the following could happen in the case of a concurrent transaction: That means that transaction would fail: as the error would be returned and in the serveFromCache method we would return http.StatusInternalServerError. However, the PR adds a guard around serveFromCache based on the state of the redis healthcheck, so this should only affect in-flight requests at the time of the healthcheck failure (and probably is acceptable). It does mean we suddenly lose protection from the Thundering Herd problem (https://www.chproxy.org/configuration/caching/). As a future improvement we can allow for the cache to fallback to file/in-memory caching + transaction maintenance so we can at least provide a certain level of protection even in case of Redis failure. We can decide to use a configuration option to allow users to decide between protection from Thundering herd or protection from Redis failure. One potential area of improvement is too also use the Alive value to short circuit certain operations of the async cache (such as AwaitForConcurrentTransaction, no need to try to reach redis) or even use it as a fallback to a simpler in-memory cache. |
Description
Trying to solve this issue: #337
Commit by commit:
Remark: the method serveFromCache already fallback to proxying requests to Clickhouse (after some time...) in case of error so this PR just add a layer in front of this mechanism.
I did not know what tests to add so feel free to tell me
Pull request type
Please check the type of change your PR introduces:
Checklist
Does this introduce a breaking change?
Further comments