Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis connection loss resiliency #390

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Vince-Chenal
Copy link

@Vince-Chenal Vince-Chenal commented Jan 9, 2024

Description

Trying to solve this issue: #337

Commit by commit:

  1. Stop checking redis connectivity at application startup
  2. Add a background routine that ping redis and maintain its status
  3. Serve requests from cache only when cache is seen as alive
  4. Add a cache_alive metric to enable monitoring cache connectivity
  5. Update rediscache wrong instantiation test according to 1.

Remark: the method serveFromCache already fallback to proxying requests to Clickhouse (after some time...) in case of error so this PR just add a layer in front of this mechanism.

I did not know what tests to add so feel free to tell me

Pull request type

Please check the type of change your PR introduces:

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation content changes
  • Other (please describe):

Checklist

  • Linter passes correctly
  • Add tests which fail without the change (if possible)
  • All tests passing
  • Extended the README / documentation, if necessary

Does this introduce a breaking change?

  • Yes
  • No

Further comments

Copy link

render bot commented Jan 9, 2024

Copy link
Collaborator

@Blokje5 Blokje5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few initial comments. @mga-chka also mentioned that some metadata was stored in Redis that would be lost if we switch between Cache and no Cache fallbacks. We will need to iterate on this PR if that is still the case.

func (f *redisCache) checkAlive() {
for {
select {
case <-f.quitAliveCheck:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more idiomatic way IMO would be to use a cancellable context, but it is a small nitpick. A lot of the code base was written before context become the norm anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively renaming the channel to done works for me as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never used cancellable context yet, I just renamed the channel for now.
But don't hesitate to tell me if needed 🙏

return
default:
f.alive = f.client.Ping(context.Background()).Err() == nil
time.Sleep(pingInterval)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using:

ticker := time.NewTicker(pingInterval)

  for {
      select {
      case <-done:
          return
      case t := <-ticker.C:
          ....
      }
  }

Will ensure that we ping at every ping interval, taking the time it takes to execute the Ping as well (e.g. if the ping interval is 5s and we take 1s to ping, the next tick will run after 4s).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍 I switched to use ticker instead

}

func (f *redisCache) Alive() bool {
return f.alive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While most likely it isn't an issue now (there is only one reader and one writer). There is a race condition on the alive boolean and an atomic variable would be better suited.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idea crossed my mind but I thought it wasn't necessary for now. I can work on something if needed

@Blokje5
Copy link
Collaborator

Blokje5 commented Jan 15, 2024

So as discussed offline this PR does have an impact on concurrent transactions

As far as I can see the following could happen in the case of a concurrent transaction:
TransactionRegistry is used to maintain status of concurrent queries between multiple proxies.
Concurrent Transactions are checked in AwaitForConcurrentTransaction, which will return TransactionStatus{State: transactionAbsent} + an error in case of an error on redis.Get.

That means that transaction would fail: as the error would be returned and in the serveFromCache method we would return http.StatusInternalServerError.

However, the PR adds a guard around serveFromCache based on the state of the redis healthcheck, so this should only affect in-flight requests at the time of the healthcheck failure (and probably is acceptable).

It does mean we suddenly lose protection from the Thundering Herd problem (https://www.chproxy.org/configuration/caching/).

As a future improvement we can allow for the cache to fallback to file/in-memory caching + transaction maintenance so we can at least provide a certain level of protection even in case of Redis failure.

We can decide to use a configuration option to allow users to decide between protection from Thundering herd or protection from Redis failure.

One potential area of improvement is too also use the Alive value to short circuit certain operations of the async cache (such as AwaitForConcurrentTransaction, no need to try to reach redis) or even use it as a fallback to a simpler in-memory cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants