Cache Reconnection #1948

mattwiller · 2023-05-01T18:13:38Z

mattwiller
May 1, 2023
Maintainer

Goals

Allow for system to continue serving traffic if cache is unavailable
Ensure that database is protected from excessive load when data cannot be served from cache
Ensure that stale data is not returned when the cache becomes available again

Present state

Medplum currently uses a cache-aside caching strategy:

Cache and DB are separate services that the application communicates with
- The cache and DB do not communicate with one another directly
On read, check cache and return if found; else read from database and store in cache
On write, store in DB, then in cache
Enables resilience against cache unavailability, since cache is not in database read path
- This is contrary to e.g. a write-through approach
- Currently, we do not take advantage of this; we would need to handle what happens when the cache reconnects (shown below)
Stale data is returned from cache in (brief) window between write to DB and to cache
- We may want to measure this window going forward to understand whether this would ever become a problem

Cache Reconnection Scenario

sequenceDiagram
    Server -x Cache: GET Patient/1
    Server ->> Database: SELECT * FROM Patient WHERE id = '1'
    Database ->> Server: 
    loop 
     Server --) Cache: check connection
    end

    Server ->> Database: UPDATE Patient SET ... WHERE id = '1'
    Server ->> Database: SELECT * FROM Patient WHERE id = '1'
    Database ->> Server: 

    Server -> Cache: connection established

    Server ->> Cache: GET Patient/1
    Note right of Cache: Cache has stale data

Proposed solution

If the cache is unavailable, falling back to read and write directly to the database will maintain application availability but also has the potential to increase the load on the database to unsustainable levels. To mitigate this risk, service-wide rate limits for all read and write operations could be enforced. In the event that the cache becomes unavailable, we could also consider dynamically reducing those limits.

When reconnecting to the cache after a period of unavailability, we cannot assume that the cache is in any particular state: it may have crashed completely and now be empty, or it could have been on the other side of a network partition and be full of potentially stale data. The simplest solution in this case would be to issue a FLUSHALL SYNC command to Redis after reconnecting, to ensure that we start from an empty cache. This should prevent the application from reading stale data, with the caveat that multiple server instances will reconnect over a period of time, repeatedly clearing the cache and resulting in degraded performance until all instances are reconnected.

codyebberson · 2023-05-02T20:25:37Z

codyebberson
May 2, 2023
Maintainer

Thanks @mattwiller, this is a great doc.

As discussed offline, we may need to consider separating BullMQ out to a separate Redis instance to avoid interfering with ongoing jobs.

Overall, this seems like a win for correctness and durability.

0 replies

mattwiller · 2023-05-03T23:46:15Z

mattwiller
May 3, 2023
Maintainer Author

After some more offline discussion and research, I think a rough sequencing of work might look like this:

Create a separate Redis Elasticache cluster for "durable" data (e.g. queued jobs) and move BullMQ to use that cluster
- Note that this will probably not improve durability for BullMQ jobs: in the unlikely event of cluster failure, all stored data could be lost
(optional) Implement rate limiter backed by "durable" Redis cluster to protect database from overload when cache is not available
- I think this would be good to do up front, but I'm open to moving it later in the sequence
Allow server to function when cache is unavailable by going directly to the database, including cache reconnection logic

Additionally, a few considerations that have been identified for this:

New infrastructure components add complexity, and should probably be behind a major version bump
Redis, as managed by AWS Elasticache, is not in general durable and may lose data; we might want to look into alternatives down the road if we need better guarantees for queued jobs or rate limit data

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache Reconnection #1948

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Cache Reconnection #1948

mattwiller May 1, 2023 Maintainer

Goals

Present state

Cache Reconnection Scenario

Proposed solution

Replies: 2 comments

codyebberson May 2, 2023 Maintainer

mattwiller May 3, 2023 Maintainer Author

mattwiller
May 1, 2023
Maintainer

codyebberson
May 2, 2023
Maintainer

mattwiller
May 3, 2023
Maintainer Author