Switch default protocol to v3 #152

Pchelolo · 2020-06-30T14:41:01Z

Rational
In order to support newer features of the ratelimit protocol (like rate limit overrides), we’d need the service to use v3 protocol instead of v2 protocol. Thus, we propose to upgrade the ratelimiter core to v3 protocol. Since there doesn’t seem to be any real changes between v2 and v3, this seems to be as easy as replacing the imports and referencing newer versions in the data plane.

However, to support complete compatibility with v2, we’d need to have a legacy layer, just like it’s done for v1 now. Supporting 3 versions of the protocol with 2 legacies seems like an overkill, so we propose to simultaneously drop support for v1 ratelimit.proto. Seems like it should be ok given that the release schedule proposed doing so in Q32018 [2]

Proposal
Thus, the concrete proposal:
Protocol v1 support is dropped
Protocol v2 becomes legacy and converted to v3. Given that the protocols are only different in message names, this should have almost no performance penalty
The service core is updated to use protocol v3

We intend to contribute all the code. We are also planning to use v3 protocol in production in our own fork in the beginning, thus we can provide some battle testing. What do you think?

mattklein123 · 2020-06-30T15:22:57Z

+1, awesome, thanks.

Pchelolo · 2020-07-01T18:30:38Z

Resolved with #11595

Pchelolo · 2020-07-02T14:16:53Z

I have realized a few followups are needed: #155

ysawa0 · 2020-07-10T15:46:48Z

@Pchelolo after testing the latest master with these changes I noticed a big performance degradation with V2 API. Under light load the p999 latency went from 15ms => 22ms which is not ideal as we try to keep p999 under 20ms to avoid as many fail-open rate limit requests.

I did a CPU profiling with pprof and suspect it's due to more time being spent in the converting of v2 request to v3:
https://github.com/envoyproxy/ratelimit/blob/master/src/service/ratelimit_legacy.go#L67-L93

Is there a better way we can convert the request? As v2 is to be officially deprecated at end of this year, moving to v3 is good (and we plan to move our Envoy configs to v3 as well) but I feel there are many orgs still on v2.

Pchelolo · 2020-07-10T16:00:23Z

@ysawa0 thank you so much for testing. I have created #155 as a followup to this, which switches converting v2->v3 to be much more explicit. Originally that was done because one of the fields was renamed, thus serializing/unserializing doesn't work any more, it wasn't covered by tests, thus forgotten. But I feel it could be beneficial for performance as a side effect.

If you don't mind testing that version, given that you have perf tests setup, it would be great. If you don't have time for that, please let me know, I'll set myself up for performance test and see if #155 mitigates this problem.

A performance degradation is expected, since we now need to do the conversion and no matter how the conversion is done, it will cost something, but 7ms increase indeed seems too high of a price to pay. Perhaps if we don't come up with a solution, #153 should be reverted and only reapplied after v2 was deprecated?

ysawa0 · 2020-07-13T16:14:37Z

Thanks @Pchelolo I tried out the PR and that looks better in pprof. Let me do more long-term tests and get back to you with a definitive answer!

Pchelolo · 2020-07-13T16:16:12Z

@ysawa0 thank you!

ysawa0 · 2020-07-15T23:13:51Z

#155 did indeed fix a lot of the issues but after testing ratelimit before the v2/v3 api code was even merged at all, I still see performance degradation. I compared two builds of ratelimit, one before #137 was merged and one after.
I suspect that this Redis code update, while it improves latency for calls to Redis noticeably, it is causing the DoLimit function to spend more cpu time while processing a request. Sorry to pull you in but @caitong93 do you have any insight into this?

image of p999/p99 latency bumping up when I switch versions

caitong93 · 2020-07-18T16:27:22Z

@ysawa0

#137 make redis client do pipelining more efficiently to improve throughput (by buffering as much as possible commands from concurrent requests in a time window), so it is possible to introduce more latency.

Please try to set REDIS_PIPELINE_WINDOW and REDIS_PIPELINE_LIMIT to values smaller than defaults https://github.com/envoyproxy/ratelimit/pull/137/files#diff-04c6e90faac2675aa89e2176d2eec7d8R419

it is causing the DoLimit function to spend more cpu time while processing a request

I will try to investigate this.

ysawa0 · 2020-07-23T00:30:21Z

@caitong93 I tested with

        - name: REDIS_PIPELINE_WINDOW
          value: "0"
        - name: REDIS_PIPELINE_LIMIT
          value: "0"

this would turn the feature completely off, correct?

caitong93 · 2020-07-23T02:39:18Z

this would turn the feature completely off, correct?

Yes, in this case, redis commands will be sent immediately

ysawa0 · 2020-08-03T19:55:44Z

Apologizes for delay in this update. Been trying to find the time to do this test the right way. Here's what I saw after trying 3 builds. For Redis, I let an Envoy sidecar do the routing via the Redis filter and turned off pipelining.

All reported latencies are the p999 under ~2k rps.

With latest (after #158 merge, commit d7d563a) 20ms latency.
After #137 merge (commit afda91c) 22ms latency
With commit 1657473, right before 137, 16.5ms latency.

Not sure if anyone else is having the same experience as us but that's what I recorded. Of course I could be configuring something incorrectly or it's due to some uniqueness with our system. In that case, a second set of eyes would be great! I'll do some profiling to see if there's any low hanging fruit to optimize. I think one thing could be replacing logrus with something more performant like zap.

caitong93 · 2020-08-05T14:17:08Z

Comparing benchmark stats for no pipeline case (by run https://github.com/envoyproxy/ratelimit/blob/master/test/redis/bench_test.go#L68), I do observe cpu performance degradation.

benchstat /tmp/ratelimit-before-1657473a.stat  /tmp/ratelimit-master.stat 
name                           old time/op    new time/op    delta
ParallelDoLimit/no_pipeline-4    12.4µs ± 0%    16.1µs ± 0%   ~     (p=1.000 n=1+1)

name                           old alloc/op   new alloc/op   delta
ParallelDoLimit/no_pipeline-4      624B ± 0%      302B ± 0%   ~     (p=1.000 n=1+1)

name                           old allocs/op  new allocs/op  delta
ParallelDoLimit/no_pipeline-4      29.0 ± 0%      16.0 ± 0%   ~     (p=1.000 n=1+1)

Actually before #137 , the client use explicitly pipelining, this may lead to different main reasons, will try to send a PR to optimize in this week

caitong93 · 2020-08-07T06:21:38Z

@ysawa0 can you try this to see if it can fix the problem #163

open a new issue to track this #164

ysawa0 · 2020-08-07T18:59:52Z

@caitong93 Beautiful. I've run #163 vs pre #137 and they are reporting basically same p999 latency and timeouts! Thank you!

Pchelolo mentioned this issue Jun 30, 2020

Proposal: support for dynamic rate limits envoyproxy/envoy#11595

Closed

mattklein123 added the no stalebot label Jun 30, 2020

Pchelolo mentioned this issue Jul 1, 2020

Drop support for legacy ratelimit.proto and upgrade to v3 rls.proto #153

Merged

Pchelolo closed this as completed Jul 1, 2020

Pchelolo reopened this Jul 2, 2020

caitong93 mentioned this issue Aug 7, 2020

redis client: default to use explicit pipelining #163

Merged

caitong93 mentioned this issue Aug 7, 2020

Cpu performance degradation introduced by #137 #164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch default protocol to v3 #152

Switch default protocol to v3 #152

Pchelolo commented Jun 30, 2020

mattklein123 commented Jun 30, 2020

Pchelolo commented Jul 1, 2020

Pchelolo commented Jul 2, 2020

ysawa0 commented Jul 10, 2020 •

edited

Pchelolo commented Jul 10, 2020 •

edited

ysawa0 commented Jul 13, 2020

Pchelolo commented Jul 13, 2020

ysawa0 commented Jul 15, 2020 •

edited

caitong93 commented Jul 18, 2020

ysawa0 commented Jul 23, 2020

caitong93 commented Jul 23, 2020

ysawa0 commented Aug 3, 2020

caitong93 commented Aug 5, 2020

caitong93 commented Aug 7, 2020 •

edited

ysawa0 commented Aug 7, 2020

Switch default protocol to v3 #152

Switch default protocol to v3 #152

Comments

Pchelolo commented Jun 30, 2020

mattklein123 commented Jun 30, 2020

Pchelolo commented Jul 1, 2020

Pchelolo commented Jul 2, 2020

ysawa0 commented Jul 10, 2020 • edited

Pchelolo commented Jul 10, 2020 • edited

ysawa0 commented Jul 13, 2020

Pchelolo commented Jul 13, 2020

ysawa0 commented Jul 15, 2020 • edited

caitong93 commented Jul 18, 2020

ysawa0 commented Jul 23, 2020

caitong93 commented Jul 23, 2020

ysawa0 commented Aug 3, 2020

caitong93 commented Aug 5, 2020

caitong93 commented Aug 7, 2020 • edited

ysawa0 commented Aug 7, 2020

ysawa0 commented Jul 10, 2020 •

edited

Pchelolo commented Jul 10, 2020 •

edited

ysawa0 commented Jul 15, 2020 •

edited

caitong93 commented Aug 7, 2020 •

edited