Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch default protocol to v3 #152

Open
Pchelolo opened this issue Jun 30, 2020 · 15 comments
Open

Switch default protocol to v3 #152

Pchelolo opened this issue Jun 30, 2020 · 15 comments

Comments

@Pchelolo
Copy link
Contributor

Rational
In order to support newer features of the ratelimit protocol (like rate limit overrides), we’d need the service to use v3 protocol instead of v2 protocol. Thus, we propose to upgrade the ratelimiter core to v3 protocol. Since there doesn’t seem to be any real changes between v2 and v3, this seems to be as easy as replacing the imports and referencing newer versions in the data plane.

However, to support complete compatibility with v2, we’d need to have a legacy layer, just like it’s done for v1 now. Supporting 3 versions of the protocol with 2 legacies seems like an overkill, so we propose to simultaneously drop support for v1 ratelimit.proto. Seems like it should be ok given that the release schedule proposed doing so in Q32018 [2]

Proposal
Thus, the concrete proposal:
Protocol v1 support is dropped
Protocol v2 becomes legacy and converted to v3. Given that the protocols are only different in message names, this should have almost no performance penalty
The service core is updated to use protocol v3

We intend to contribute all the code. We are also planning to use v3 protocol in production in our own fork in the beginning, thus we can provide some battle testing. What do you think?

@mattklein123
Copy link
Member

+1, awesome, thanks.

@Pchelolo
Copy link
Contributor Author

Pchelolo commented Jul 1, 2020

Resolved with #11595

@Pchelolo Pchelolo closed this as completed Jul 1, 2020
@Pchelolo
Copy link
Contributor Author

Pchelolo commented Jul 2, 2020

I have realized a few followups are needed: #155

@Pchelolo Pchelolo reopened this Jul 2, 2020
@ysawa0
Copy link
Member

ysawa0 commented Jul 10, 2020

@Pchelolo after testing the latest master with these changes I noticed a big performance degradation with V2 API. Under light load the p999 latency went from 15ms => 22ms which is not ideal as we try to keep p999 under 20ms to avoid as many fail-open rate limit requests.

I did a CPU profiling with pprof and suspect it's due to more time being spent in the converting of v2 request to v3:
https://github.com/envoyproxy/ratelimit/blob/master/src/service/ratelimit_legacy.go#L67-L93

Is there a better way we can convert the request? As v2 is to be officially deprecated at end of this year, moving to v3 is good (and we plan to move our Envoy configs to v3 as well) but I feel there are many orgs still on v2.

@Pchelolo
Copy link
Contributor Author

Pchelolo commented Jul 10, 2020

@ysawa0 thank you so much for testing. I have created #155 as a followup to this, which switches converting v2->v3 to be much more explicit. Originally that was done because one of the fields was renamed, thus serializing/unserializing doesn't work any more, it wasn't covered by tests, thus forgotten. But I feel it could be beneficial for performance as a side effect.

If you don't mind testing that version, given that you have perf tests setup, it would be great. If you don't have time for that, please let me know, I'll set myself up for performance test and see if #155 mitigates this problem.

A performance degradation is expected, since we now need to do the conversion and no matter how the conversion is done, it will cost something, but 7ms increase indeed seems too high of a price to pay. Perhaps if we don't come up with a solution, #153 should be reverted and only reapplied after v2 was deprecated?

@ysawa0
Copy link
Member

ysawa0 commented Jul 13, 2020

Thanks @Pchelolo I tried out the PR and that looks better in pprof. Let me do more long-term tests and get back to you with a definitive answer!

@Pchelolo
Copy link
Contributor Author

@ysawa0 thank you!

@ysawa0
Copy link
Member

ysawa0 commented Jul 15, 2020

#155 did indeed fix a lot of the issues but after testing ratelimit before the v2/v3 api code was even merged at all, I still see performance degradation. I compared two builds of ratelimit, one before #137 was merged and one after.
I suspect that this Redis code update, while it improves latency for calls to Redis noticeably, it is causing the DoLimit function to spend more cpu time while processing a request. Sorry to pull you in but @caitong93 do you have any insight into this?

image of p999/p99 latency bumping up when I switch versions
image

@caitong93
Copy link
Contributor

@ysawa0

#137 make redis client do pipelining more efficiently to improve throughput (by buffering as much as possible commands from concurrent requests in a time window), so it is possible to introduce more latency.

Please try to set REDIS_PIPELINE_WINDOW and REDIS_PIPELINE_LIMIT to values smaller than defaults https://github.com/envoyproxy/ratelimit/pull/137/files#diff-04c6e90faac2675aa89e2176d2eec7d8R419

it is causing the DoLimit function to spend more cpu time while processing a request

I will try to investigate this.

@ysawa0
Copy link
Member

ysawa0 commented Jul 23, 2020

@caitong93 I tested with

        - name: REDIS_PIPELINE_WINDOW
          value: "0"
        - name: REDIS_PIPELINE_LIMIT
          value: "0"

this would turn the feature completely off, correct?

@caitong93
Copy link
Contributor

this would turn the feature completely off, correct?

Yes, in this case, redis commands will be sent immediately

@ysawa0
Copy link
Member

ysawa0 commented Aug 3, 2020

Apologizes for delay in this update. Been trying to find the time to do this test the right way. Here's what I saw after trying 3 builds. For Redis, I let an Envoy sidecar do the routing via the Redis filter and turned off pipelining.

All reported latencies are the p999 under ~2k rps.

With latest (after #158 merge, commit d7d563a) 20ms latency.
After #137 merge (commit afda91c) 22ms latency
With commit 1657473, right before 137, 16.5ms latency.

Not sure if anyone else is having the same experience as us but that's what I recorded. Of course I could be configuring something incorrectly or it's due to some uniqueness with our system. In that case, a second set of eyes would be great! I'll do some profiling to see if there's any low hanging fruit to optimize. I think one thing could be replacing logrus with something more performant like zap.

@caitong93
Copy link
Contributor

Comparing benchmark stats for no pipeline case (by run https://github.com/envoyproxy/ratelimit/blob/master/test/redis/bench_test.go#L68), I do observe cpu performance degradation.

benchstat /tmp/ratelimit-before-1657473a.stat  /tmp/ratelimit-master.stat 
name                           old time/op    new time/op    delta
ParallelDoLimit/no_pipeline-4    12.4µs ± 0%    16.1µs ± 0%   ~     (p=1.000 n=1+1)

name                           old alloc/op   new alloc/op   delta
ParallelDoLimit/no_pipeline-4      624B ± 0%      302B ± 0%   ~     (p=1.000 n=1+1)

name                           old allocs/op  new allocs/op  delta
ParallelDoLimit/no_pipeline-4      29.0 ± 0%      16.0 ± 0%   ~     (p=1.000 n=1+1)

Actually before #137 , the client use explicitly pipelining, this may lead to different main reasons, will try to send a PR to optimize in this week

@caitong93
Copy link
Contributor

caitong93 commented Aug 7, 2020

@ysawa0 can you try this to see if it can fix the problem #163

open a new issue to track this #164

@ysawa0
Copy link
Member

ysawa0 commented Aug 7, 2020

@caitong93 Beautiful. I've run #163 vs pre #137 and they are reporting basically same p999 latency and timeouts! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants