Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support checking a percentage of cache fetches against the database #492

Open
dylanahsmith opened this issue Mar 24, 2021 · 0 comments
Open

Comments

@dylanahsmith
Copy link
Contributor

Problem

It can be hard to get a sense of how wrong the cache normally is (e.g. from cache invalidations being lost due to network issues) or notice when application bugs (e.g. writes that don't trigger after_commit) or unknown Identity Cache bugs make it worse.

Proposal

Add support for checking a percentage of cache hits for correctness against the database. This could then be exposed with ActiveSupport::Notifications.instrument, which could be used to get a correctness ratio over time to see regressions being introduced and that could also be split by cache index to notice bugs that affect a subset of cache indexes.

The data loaded from the database can be serialized and compared to the serialized data fetched from the cache. If they differ, then we can attempt to CAS set the data loaded from the database to the cache to both correct the cached value.

Cache invalidations aren't done atomically with the database write, so we should aim to reduce false positives. Detecting CAS set conflicts when correcting the cached value is one way to do this, but it is still possible for a recent database write to be loaded and for the cache invalidation to complete after the cache value is "corrected". As such, we should provide the maximum updated_at timestamp in any cached rows to try to get the age of the database data which can then be used to exclude recently written data when trying to find incorrectness from missing cache invalidations. Note that these timestamps are affected by clock skew and are from the time before the write rather than the commit time. An appropriate threshold for excluding recently written data should include the sum of the maximum expected clock skew, maximum transaction duration and the maximum expected duration to invalidate the cache after the transaction commits; we might just want to conservatively use the hard timeout duration for web requests & jobs for simplicity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant