Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transient in memory cache #4922

Draft
wants to merge 10 commits into
base: dev
Choose a base branch
from
Draft

transient in memory cache #4922

wants to merge 10 commits into from

Conversation

Geal
Copy link
Contributor

@Geal Geal commented Apr 5, 2024

This adds another level of in memory cache, to mitigate issues with unique or infrequent queries pushing frequently used queries out of the in memory cache.
If a query has only been seen once, then it is stored inthe transient cache (short term, small number of entries, LRU).
If that query is requested again, and is still present in the transient cache, then it is added to the long term, larger in memory cache, to the redis cache as well, and removed from the transient cache. If that query is requested again but not present in the transient cache, then we test the larger in memory cache and redis

Description here

Fixes #issue_number


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

This adds another level of in memory cache, to mitigate issues with
unique or infrequent queries pushing frequently used queries out of the
in memory cache.
If a query has only been seen once, then it is stored inthe transient
cache (short term, small number of entries, LRU).
If that query is requested again, and is still present in the transient
cache, then it is added to the long term, larger in memory cache, to the
redis cache as well, and removed from the transient cache.
If that query is requested again but not present in the transient cache,
then we test the larger in memory cache and redis
Copy link
Contributor

github-actions bot commented Apr 5, 2024

@Geal, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@router-perf
Copy link

router-perf bot commented Apr 5, 2024

CI performance tests

  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • large-request - Stress test with a 1 MB request payload
  • const - Basic stress test that runs with a constant number of users
  • no-graphos - Basic stress test, no GraphOS.
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • xlarge-request - Stress test with 10 MB request payload
  • step - Basic stress test that steps up the number of users over time

@garypen
Copy link
Contributor

garypen commented Apr 10, 2024

We need to do something in this space, but I feel we can do something better when #4796 lands.

I imagine that with the new rust support we could do something like:

(query from client) -> new functionality in 4796 -> (normalised form)

Then use (normalised form) as the key to the query plan.

If the conversion to (normalised form) is expensive we could introduce a smaller cache to hold (query from client) -> (normalised form), but I imagine that shouldn't be required.

Note: Some details (extra items in the key such as metadata) are omitted for brevity, but wouldn't this basic approach work?

@Geal
Copy link
Contributor Author

Geal commented Apr 10, 2024

That's definitely something we could do. And we already have the right cache for that, in query analysis #4796 (comment)

@bonnici
Copy link
Contributor

bonnici commented Apr 11, 2024

Note that right now the normalised form is stripping out some important information like alias names and input objects, but I'm planning to add support for including those things as part of PULSR-695.

@Geal
Copy link
Contributor Author

Geal commented Apr 11, 2024

@bonnici for some context, we recently merged #4883 which uses a hashing scheme for the query that keeps the same hash across schema updates, if the update does not affect the query. Unfortunately, we needed to add back the operation name to the query plan cache key (#4921), because the query planner was returning the usage reporting structure as well, and the operation signature contains the operation name, so with the operation name in the cache key, we would have not reported operations properly.
But now, with the usage reporting and operation signature generation happening on the rust side, we will be able to move that task entirely to the query analysis layer (ase we're doing with validation in #4551), which executes much earlier. And in that layer, we could also have a transformation layer to normalize the operation name, extract hardcoded variables, etc, so that we reduce further the number of queries sent to planning. And the client side would see no difference because the responnse formatting is done according to the original query

@Geal Geal requested a review from a team April 23, 2024 10:08
@abernix abernix marked this pull request as draft May 8, 2024 09:12
@abernix
Copy link
Member

abernix commented May 8, 2024

I've converted this to a draft so we can discuss it further, but it won't be on our review backlog for now.

@Geal
Copy link
Contributor Author

Geal commented May 13, 2024

@abernix it was in the review queue because I expected some reviews :/

@garypen
Copy link
Contributor

garypen commented May 13, 2024

Before looking too much into the implementation of this, do we still think this is the right solution for this problem?

My comment above still seems applicable. More so, since various changes have landed that make "normalised form" a more realistic prospect.

In general, this feels like a "bigger" decision than code review and could use some more substantial design reviewing before proceeding to code review. Perhaps we could discuss this in the next router architectural working group meeting? One thing we definitely want to discuss is whether we should expose configuration of this feature.

@Geal
Copy link
Contributor Author

Geal commented May 13, 2024

normalization is a much larger issue than this, and not that much related. I agree that it needs further discussion though, I added it to the topics of the meeting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants