Return owned memory via BytesMut to reduce downstream copying #1182

tbraun96 · 2021-08-19T21:12:41Z

Compiles and runs. Tests return Ok(())

djc · 2021-08-19T21:30:00Z

What is the difference between the two commits you have here?

tbraun96 · 2021-08-19T21:39:08Z

What is the difference between the two commits you have here?

There was a warning of an unused import That I cleared. Also, just made a new commit. Cargo test checks out fine on my machine

tbraun96 · 2021-08-19T21:47:08Z

Might I also add that in order for this to work, publicly-exposed functions like send_datagram now accept a BytesMut, so this would naturally be a breaking change.

Ralith · 2021-08-19T23:39:57Z

Thanks for the PR!

publicly-exposed functions like send_datagram now accept a BytesMut

Interfaces that consume data should always take Bytes, because BytesMut can be efficiently converted to Bytes with BytesMut::freeze but the reverse is not true.

…o minimize unnecessary copying

tbraun96 · 2021-08-20T22:19:03Z

Okay, I adjusted it. Public functions are now normal as desired. Had to make Datagram an enum to differentiate between outgoing and incoming to ensure zerocopying, otherwise, there were parts in the code where there would be unnecessary copying.

Matthias247 · 2021-08-27T03:20:35Z

quinn-proto/src/connection/mod.rs

@@ -1651,10 +1651,10 @@ where
                }
            }
            let offset = self.spaces[space].crypto_offset;
-            let outgoing = Bytes::from(outgoing);
+            let outgoing = BytesMut::from(&outgoing[..]);


I would be extremely careful with this. BytesMut is implemented different than Bytes, and might have vastly different allocation and performance implications. Please rule that out and run benchmarks to check that nothing changes.

I would highly recommend not to touch this code. And I actually doubt it needs to, since the idea was just about making the received bytes available as BytesMut.

Hm, yeah. Just for starters, this is introducing an an alloc + copy, and the clone().freeze() below is introducing another. While that's probably not going to be a bottleneck, it shouldn't be needed.

Good catch. Looks like that change was rendered unnecessary after using the enum to differentiate between ingoing/outgoing type. That code no longer performs an unnecessary copy

Thanks for looking at it again. However this wasn't just a comment on one particular instance. This affects more places in the CR which are potentially degrading performance.

The CR should have attached benchmark results which really show that this isn't degrading performance for existing use-cases, and increases the performance for the new use-case.

Without this, it looks like there is a hypothetical issue for a rare use-case, a change tries to address this, and without any confirmation whether the change actually helps other parts might get degraded.

How should I test for it? Have one test spamming Unreliable UDP packets to a receiving end (localhost->localhost) using the current master branch, then another test doing the same but on my fork?

You can look at the benchmark that exists for streams: https://github.com/quinn-rs/quinn/tree/main/bench

Run that before and after your change. And check that your socket buffers are set high enough (> 1MB) that those are not the bottleneck.

A similar benchmark can be written for datagram transmission

Both benchmarked on release compilation. Testing the vanilla benchmark.
Macbook Pro M1

Forked Version:

Sent 1073741824 bytes on 1 streams in 2.63s (388.88 MiB/s) Stream metrics: │ Throughput │ Duration ──────┼───────────────┼────────── AVG │ 388.88 MiB/s │ 2.63s P0 │ 388.75 MiB/s │ 2.63s P10 │ 389.00 MiB/s │ 2.63s P50 │ 389.00 MiB/s │ 2.63s P90 │ 389.00 MiB/s │ 2.63s P100 │ 389.00 MiB/s │ 2.63s Sent 1073741824 bytes on 1 streams in 2.63s (388.66 MiB/s) Stream metrics: │ Throughput │ Duration ──────┼───────────────┼────────── AVG │ 388.62 MiB/s │ 2.64s P0 │ 388.50 MiB/s │ 2.63s P10 │ 388.75 MiB/s │ 2.64s P50 │ 388.75 MiB/s │ 2.64s P90 │ 388.75 MiB/s │ 2.64s P100 │ 388.75 MiB/s │ 2.64s Sent 1073741824 bytes on 1 streams in 2.65s (386.73 MiB/s) Stream metrics: │ Throughput │ Duration ──────┼───────────────┼────────── AVG │ 386.62 MiB/s │ 2.65s P0 │ 386.50 MiB/s │ 2.65s P10 │ 386.75 MiB/s │ 2.65s P50 │ 386.75 MiB/s │ 2.65s P90 │ 386.75 MiB/s │ 2.65s P100 │ 386.75 MiB/s │ 2.65s

Quinn Master:

Sent 1073741824 bytes on 1 streams in 2.64s (387.17 MiB/s) Stream metrics: │ Throughput │ Duration ──────┼───────────────┼────────── AVG │ 387.12 MiB/s │ 2.65s P0 │ 387.00 MiB/s │ 2.64s P10 │ 387.25 MiB/s │ 2.65s P50 │ 387.25 MiB/s │ 2.65s P90 │ 387.25 MiB/s │ 2.65s P100 │ 387.25 MiB/s │ 2.65s Sent 1073741824 bytes on 1 streams in 2.64s (387.34 MiB/s) Stream metrics: │ Throughput │ Duration ──────┼───────────────┼────────── AVG │ 387.38 MiB/s │ 2.64s P0 │ 387.25 MiB/s │ 2.64s P10 │ 387.50 MiB/s │ 2.64s P50 │ 387.50 MiB/s │ 2.64s P90 │ 387.50 MiB/s │ 2.64s P100 │ 387.50 MiB/s │ 2.64s Sent 1073741824 bytes on 1 streams in 2.65s (386.21 MiB/s) Stream metrics: │ Throughput │ Duration ──────┼───────────────┼────────── AVG │ 386.12 MiB/s │ 2.65s P0 │ 386.00 MiB/s │ 2.65s P10 │ 386.25 MiB/s │ 2.65s P50 │ 386.25 MiB/s │ 2.65s P90 │ 386.25 MiB/s │ 2.65s P100 │ 386.25 MiB/s │ 2.65s

The forked version is slightly faster, though, there are many reasons why that could be (e.g., OS). Since the difference is basically statistically insignificant, we shouldn't consider the change harmful.

djc

Here's some things that stood out to me -- not quite a detailed review yet.

djc · 2021-09-08T08:22:40Z

quinn-proto/src/frame.rs

@@ -515,8 +517,7 @@ impl Crypto {
 }

 pub struct Iter {
-    // TODO: ditch io::Cursor after bytes 0.5


You removed this TODO without resolving it -- wondering why?

Okay, it's been reverted. Originally, I tried solving the problem, but the tests failed so I accidently reverted everything except that TODO. How might the new TODO be solved? I could include that in this PR if you want.

djc · 2021-09-08T08:27:45Z

quinn-proto/src/frame.rs

+}
+
+impl Datagram {
+    pub(crate) fn assert_incoming(self) -> BytesMut {


Let's not panic here. Instead, call the function incoming() and have it return Option<BytesMut>.

…nged assert_incoming to return an Option instead of panicking

Matthias247 · 2021-09-11T22:28:08Z

quinn-proto/src/frame.rs

        let len = self.bytes.get_var()?;
        if len > self.bytes.remaining() as u64 {
            return Err(UnexpectedEnd);
        }
        let start = self.bytes.position() as usize;
        self.bytes.advance(len as usize);
-        Ok(self.bytes.get_ref().slice(start..(start + len as usize)))
+        Ok(BytesMut::from(


I think this now creates a deep copy of all incoming data, where there hasn't been one before.

Since includes datagram frames, it should effectively negate the whole purpose of this PR?

My guess is that it didn't show up more negative in the benchmark since that one just uses a single stream and has always sufficient data to send, to the application runs into the take_remaining case which stays cheap. But if people try to send smaller datagrams or use small streams it would it get more expensive.

Since includes datagram frames, it should effectively negate the whole purpose of this PR?

True. I wasn't aware slice was a cheap operation; I somehow thought it was a copy.

Going through this all, it seems this PR won't work without significantly changing the buffering mechanism. Have any ideas?

It seems like one thing we could do is to buffer BytesMuts throughout the (quinn-proto) stack. That way we can split them wherever necessary, I think, without copying. Would that make sense?

Matthias247 · 2021-09-11T22:28:28Z

quinn-proto/src/frame.rs

@@ -633,7 +637,7 @@ impl Iter {
                Frame::Ack(Ack {
                    delay,
                    largest,
-                    additional: self.bytes.get_ref().slice(start..end),
+                    additional: Bytes::copy_from_slice(&self.bytes.get_ref()[start..end] as &[u8]),


looks like another new deep copy?

I don't think it should do this. If it already has to allocate, it should at least immediately decode the ACKs blocks and have those stored in a list, so that this work goes away later on.

tbraun96 added 2 commits August 19, 2021 17:08

replace Bytes with BytesMut in datagrams

8994152

replace Bytes with BytesMut in datagrams

3c7eb2d

tbraun96 mentioned this pull request Aug 19, 2021

Make Datagrams return BytesMut, not Bytes #1173

Open

tbraun96 changed the title ~~Resolution for #1173~~ Resolution for issue #1173 Aug 19, 2021

Fixed un-imported Bytes

20effff

tbraun96 added 2 commits August 20, 2021 18:16

Differentiating between outgoing/incoming buffers via Datagram enum t…

4f94a16

…o minimize unnecessary copying

cargo fmt

80640f6

Matthias247 reviewed Aug 27, 2021

View reviewed changes

tbraun96 and others added 3 commits August 27, 2021 06:43

Removing unnecessary copy

44edc57

Merge branch 'quinn-rs:main' into master

5091283

Merge branch 'quinn-rs:main' into master

2f51e75

djc reviewed Sep 8, 2021

View reviewed changes

Reverted back to including the TODO on the BytesMut cursor. Also, cha…

8a41fde

…nged assert_incoming to return an Option instead of panicking

Matthias247 reviewed Sep 11, 2021

View reviewed changes

Ralith changed the title ~~Resolution for issue #1173~~ Return owned memory via BytesMut to reduce downstream copying Sep 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return owned memory via BytesMut to reduce downstream copying #1182

Return owned memory via BytesMut to reduce downstream copying #1182

tbraun96 commented Aug 19, 2021

djc commented Aug 19, 2021

tbraun96 commented Aug 19, 2021

tbraun96 commented Aug 19, 2021

Ralith commented Aug 19, 2021

tbraun96 commented Aug 20, 2021

Matthias247 Aug 27, 2021

Ralith Aug 27, 2021 •

edited

tbraun96 Aug 27, 2021

Matthias247 Aug 27, 2021

tbraun96 Aug 27, 2021

Matthias247 Aug 27, 2021

tbraun96 Aug 27, 2021 •

edited

djc left a comment

djc Sep 8, 2021

tbraun96 Sep 8, 2021 •

edited

djc Sep 8, 2021

Matthias247 Sep 11, 2021

tbraun96 Sep 12, 2021

djc Sep 15, 2021

Matthias247 Sep 11, 2021

Return owned memory via BytesMut to reduce downstream copying #1182

Are you sure you want to change the base?

Return owned memory via BytesMut to reduce downstream copying #1182

Conversation

tbraun96 commented Aug 19, 2021

djc commented Aug 19, 2021

tbraun96 commented Aug 19, 2021

tbraun96 commented Aug 19, 2021

Ralith commented Aug 19, 2021

tbraun96 commented Aug 20, 2021

Choose a reason for hiding this comment

Ralith Aug 27, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbraun96 Aug 27, 2021 • edited

Choose a reason for hiding this comment

djc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbraun96 Sep 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ralith Aug 27, 2021 •

edited

tbraun96 Aug 27, 2021 •

edited

tbraun96 Sep 8, 2021 •

edited