Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] CHIP-1 Online IR and GetRequest Caching #630

Open
caiocamatta-stripe opened this issue Dec 4, 2023 · 6 comments
Open

[DISCUSS] CHIP-1 Online IR and GetRequest Caching #630

caiocamatta-stripe opened this issue Dec 4, 2023 · 6 comments

Comments

@caiocamatta-stripe
Copy link
Collaborator

caiocamatta-stripe commented Dec 4, 2023

Discuss CHIP-1 Online IR and GetRequest Caching.

Feedback on any aspect of this CHIP is welcome. Let me know if there are any questions!

@cristianfr
Copy link
Contributor

cristianfr commented Dec 6, 2023

I'm very surprised about BatchIr -> FinalBatchIr improvements with caching. Very impressive.
For the step 2:
Adding a diagram would help a bit since the other caching can interact with KVStore implementations. So it's fair for people to know if a value was recalled from the chronon fetcher cache or the kv store cache.

Right now we rely on streaming and batch overlapping quite a bit. In our impl streaming rows are stored with a significant ttl that allows us to not worry too much by having a delay in the batch upload (or refresh for this case). For tighter integrations, it's probably worth to investigate or propose some solutions to refresh the metadata as part of upload orchestration. (provide a runnable mode to update the metadata for the group by for example)

For step3:

The Flink side marks tiles as complete.
Is this by default or does it require additional set up for it?

Overall comments:

I'm not sure if I missed this, and this is why I think a diagram would be helpful, we are talking about caching for CPU performance and caching for network latency. For batch data and streaming data. In this sense I don't think there's a sequential nature to the steps, they seem to work fairly independent whether the bottleneck is at CPU or network and there's an option to trade memory for it. From what I've seen memory tends to be quite precious, so it's hard to say how much configuration is too much for this (should only certain group bys be cached? all or nothing?)

Compressing tiles sounds good but I'm not sure if the added benefits will be worth the complexity. Mainly because we need certain granularity for features say of 1 hr window. So any time we lose granularity could create unforeseen issues. For example, you may need to evict from the cache if the feature is 1 hr window and you merged too many 5 minutes blocks.

Edit: One pro of adding the caching in the KV is that sharding and collocation of the cache can be controlled by the KV better than the fetcher.

@caiocamatta-stripe
Copy link
Collaborator Author

Thanks for taking a look, @cristianfr.

Good idea to add diagrams. I added some in the new version. Let me know if those are sufficient or if there are any other diagrams you'd like me to add (I'd be happy to do so!)

I actually tested BatchIr -> FinalBatchIr caching using a production load test we have and we saw a 35% improvement in p99 serving latency on our first attempt, which is crazy. I will add details on that later on.

Right now we rely on streaming and batch overlapping quite a bit. In our impl streaming rows are stored with a significant ttl that allows us to not worry too much by having a delay in the batch upload

Similar at Stripe. We keep a few days of streaming data stored in case the batch data takes long to land.

For tighter integrations, it's probably worth to investigate or propose some solutions to refresh the metadata as part of upload orchestration

Agreed. This is one of the trickiest parts of this CHIP in my option, so I will follow up on this once we start developing it internally.

The Flink side marks tiles as complete. Is this by default or does it require additional set up for it?

Right now, this is something we can turn on and off. The implication of turning it on is that Flink will send a burst of writes to the KV store at the end of the tile timestamp. For example, if you have 1 hour tiles, at the end of each hour there will be a spike in traffic as Flink marks those tiles as complete in the KV store. We're not too worried about this spike in traffic -- I can go in detail if you're curious.

should only certain group bys be cached? all or nothing?

From my chats with other folks at Stripe, we think only certain groupbys should be cached. Certain types of entity keys aren't very skewed and so they aren't worth caching. They'd just take up memory space and cause GC pressure.

Compressing tiles sounds good but I'm not sure if the added benefits will be worth the complexity

I'm actually also realizing that merging the tiles in cache will be tricky and maybe not worth it.

Another project we are working on is tile layering (storing tiles of multiple sizes in the KV store and fetching bigger ones when possible to reduce fanout and decoding work). With tile layering, merging tiles in the cache would either be impractical or too complex. So, we may leave tile compression as a potential future optimization.

@better365
Copy link
Contributor

@caiocamatta-stripe I can't open the links in this doc. Can you please double check and update them? Thanks

@caiocamatta-stripe
Copy link
Collaborator Author

@better365 sorry about that - should be fixed now CHIP-1 Online IR and GetRequest Caching.

@caiocamatta-stripe
Copy link
Collaborator Author

Quick update: we tested Step 2 (i.e. not making batch GetRequests if the data is cached) in production and saw roughly a 39% decrease in p99 feature serving latency. I'll add more details when I get a chance.

@caiocamatta-stripe
Copy link
Collaborator Author

We finished implementing the batch side of IR caching at Stripe.

Results

We saw good latency improvements for Joins that have GroupBys with hot/skewed keys. In our latest load tests at very high RPS, caching decreased p99 latency by 43%. CPU utilization also decreased by about 30%.

We added metrics on hit rate per GroupBy and learned that you don't need a very high hit rate to see latency improvements. Enabling caching for a GroupBy with a 1.5% hit rate was enough to see a very small, but measurable impact in latency.

I'll put up a PR over the coming weeks.

We're no longer rejecting "Rejected Alternative 4" (shared cache for all GroupBys)

We learned that having a shared cache for all GroupBys is better (Rejected Alternative #4). Instead, we're going with a shared cache for all GroupBys but allowing users to enable/disable caching on a GroupBy-by-GroupBy basis. We also learned that enabling caching for features that don't benefit from caching (i.e. have a very tiny hit rate) doesn't cause any measurable performance degradation, so we'll likely enable caching by default on all GroupBys.

Here's why we prefer a shared cache:

Imagine we have a 10,000-element cache, and three GroupBys being served: A (heavy skew), B (medium skew), C (little skew).

In the one-cache-per-GroupBy scenario, to decide how much cache to allocate to each, we would likely run some SQL queries to figure out the expected degree of skew, and what the theoretical hit rate would be if we enabled caching for the GroupBy. With caches of custom size, we may find the optimal configuration is to give A 9,000 elements, B 1,000, and disable caching for C.

In the shared cache scenario, Caffeine's eviction policy does the work of figuring out the best size for us, ultimately ending up in something like 9000 + 1000.

The benefits of a shared cache are that a) we don’t have to spend a bunch of time tuning what percentage of the cache to allocate to each GroupBy, and b) the allocation is dynamic and can respond to changes in traffic in seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants