Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache invalidations can be missed on rollout of schema changes #535

Open
dylanahsmith opened this issue Apr 5, 2023 · 1 comment
Open

Comments

@dylanahsmith
Copy link
Contributor

The problem is mentioned in the README caveats:

Expected sources of lost cache invalidations include:
...
Rollout of cache namespace changes (e.g. from upgrading IdentityCache, adding columns, cached associations or from application changes to IdentityCache.cache_namespace) can result in cache fills to the new namespace that aren't invalidated by cache invalidations from a process still using the old namespace

This is particularly a problem for slower rollouts of code, such with canary deploys, which leave a larger window of time for database updates to happen with only one of the cache keys being invalidated.

It is also hard to argue that developers should expect this source of lost cache invalidations, since the developer's change may just be to the models ignored_columns attribute for a column that isn't otherwise being used. As in, this is action at a distance, which can cause confusion for developers deploying the change, as well as for users that see the stale data.

As such, I think this is a problem worth fixing. Although, I have had a hard time thinking of a way to do this, since we don't want to compromise performance significantly for this edge case and we don't want to expect developers making the change to be aware of the issue and make other changes to enable something like double cache invalidations (for both keys with the old and new denormalized_schema_hash in their cache key prefix).

Proposed Solution

I think the solution can be focused on the primary cache index, since that is what is affected by column changes. In the past, we have often bumped IdentityCache::CACHE_VERSION when changes have been made to the format of just the primary cache index, but I think we could instead introduce a cache version prefix into the denormalized_schema_hash so that we can focus on changes to that schema hash.

Given that we desire a solution that doesn't involve manual work by the developer when changing the schema and wouldn't know the schema hash for different versions of the application, we could solve that problem by moving the schema hash out of the key and into the value. The primary cache index already encodes records as a Hash for serialization, so can additional keys for versioning. The schema hash can then be checked after loading the value in order to avoid the value on a mismatch.

When we detect a schema version mismatch when loading from this new key that doesn't include the schema hash, then we don't want to just replace it with one with the correct schema version, since that could lead to contention between two versions of the application during a deploy that keep replacing each others cache values. However, this can be solved by falling back to a schema hash versioned key (e.g. we could use the same keys that we currently use), so the value can be stored for both schema versions. I'll refer to these keys as the main key (without a schema hash in the key) and the fallback key (that includes the schema hash in the key).

We won't be replacing the fallback key on cache invalidation, in fact we won't even be able to build that key for another version of the app. Instead, we can include a new SecureRandom.uuid data_version in the cache invalidation value, which will then also be copied into the corresponding value when doing a cache fill of either the main key or fallback key. That way, the cache invalidation can be detected, even after a cache invalidation and cache fill of the main key with a different schema cache version, by checking the data_version of the main key against the fallback key and treating a mismatch as if the fallback key value has been invalidated.

The performance of the proposed solution during normal operation will be comparable to the current performance, since it would just be using the main key when there aren't any schema hash mismatches. A deploy of a schema change would lead to extra latency to load the fallback key after a schema hash mismatch, but it would only be when loading caches of records for the model that had its schema changed and wouldn't introduce additional database load. After the deploy completes, it will start to convert back to just using the main key as the cache entries for the old schema get invalidated or expire.

An additional optimization can help with improving performance during schema mismatches and reduce time to convergence: after filling the fallback key as a result of a schema mismatch, copy the cache value in the main key into its corresponding fallback key, then replace the main key with multi-schema value marker with the associated data_version and a timestamp for the current time. The marker value would be much smaller, so would reduce unnecessary data transfer to load the value from the cache. A configurable duration can be used to determine when the multi-schema value marker should be replaced with the fallback key for the currently running code, so it can converge faster than the general cache value expiration duration.

@Schwad
Copy link

Schwad commented Apr 26, 2023

@dylanahsmith - Hi! there was a request for a review of this issue

First, can I echo back my understanding for confirmation here?

  1. I understand the issue you've raised is related to lost cache invalidations during slow rollouts, causing stale data and confusion.
  2. Your proposed solution focuses on the primary cache index and introduces a cache version prefix. By moving the schema hash into the value, checking it after loading, and using fallback keys in case of mismatch, you aim to effectively handle cache invalidations without significantly impacting performance during normal operations.
  3. Additionally, the suggested optimization could further reduce latency and convergence time during schema mismatches.

Is that roughly correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants