Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return owned memory via BytesMut to reduce downstream copying #1182

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

tbraun96
Copy link

Compiles and runs. Tests return Ok(())

@tbraun96 tbraun96 changed the title Resolution for #1173 Resolution for issue #1173 Aug 19, 2021
@djc
Copy link
Collaborator

djc commented Aug 19, 2021

What is the difference between the two commits you have here?

@tbraun96
Copy link
Author

What is the difference between the two commits you have here?

There was a warning of an unused import That I cleared. Also, just made a new commit. Cargo test checks out fine on my machine

@tbraun96
Copy link
Author

Might I also add that in order for this to work, publicly-exposed functions like send_datagram now accept a BytesMut, so this would naturally be a breaking change.

@Ralith
Copy link
Collaborator

Ralith commented Aug 19, 2021

Thanks for the PR!

publicly-exposed functions like send_datagram now accept a BytesMut

Interfaces that consume data should always take Bytes, because BytesMut can be efficiently converted to Bytes with BytesMut::freeze but the reverse is not true.

@tbraun96
Copy link
Author

Okay, I adjusted it. Public functions are now normal as desired. Had to make Datagram an enum to differentiate between outgoing and incoming to ensure zerocopying, otherwise, there were parts in the code where there would be unnecessary copying.

@@ -1651,10 +1651,10 @@ where
}
}
let offset = self.spaces[space].crypto_offset;
let outgoing = Bytes::from(outgoing);
let outgoing = BytesMut::from(&outgoing[..]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be extremely careful with this. BytesMut is implemented different than Bytes, and might have vastly different allocation and performance implications. Please rule that out and run benchmarks to check that nothing changes.

I would highly recommend not to touch this code. And I actually doubt it needs to, since the idea was just about making the received bytes available as BytesMut.

Copy link
Collaborator

@Ralith Ralith Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah. Just for starters, this is introducing an an alloc + copy, and the clone().freeze() below is introducing another. While that's probably not going to be a bottleneck, it shouldn't be needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Looks like that change was rendered unnecessary after using the enum to differentiate between ingoing/outgoing type. That code no longer performs an unnecessary copy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at it again. However this wasn't just a comment on one particular instance. This affects more places in the CR which are potentially degrading performance.

The CR should have attached benchmark results which really show that this isn't degrading performance for existing use-cases, and increases the performance for the new use-case.

Without this, it looks like there is a hypothetical issue for a rare use-case, a change tries to address this, and without any confirmation whether the change actually helps other parts might get degraded.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should I test for it? Have one test spamming Unreliable UDP packets to a receiving end (localhost->localhost) using the current master branch, then another test doing the same but on my fork?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can look at the benchmark that exists for streams: https://github.com/quinn-rs/quinn/tree/main/bench

Run that before and after your change. And check that your socket buffers are set high enough (> 1MB) that those are not the bottleneck.

A similar benchmark can be written for datagram transmission

Copy link
Author

@tbraun96 tbraun96 Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both benchmarked on release compilation. Testing the vanilla benchmark.
Macbook Pro M1

Forked Version:

Sent 1073741824 bytes on 1 streams in 2.63s (388.88 MiB/s)

Stream metrics:

      │  Throughput   │ Duration 
──────┼───────────────┼──────────
 AVG  │  388.88 MiB/s │     2.63s
 P0   │  388.75 MiB/s │     2.63s
 P10  │  389.00 MiB/s │     2.63s
 P50  │  389.00 MiB/s │     2.63s
 P90  │  389.00 MiB/s │     2.63s
 P100 │  389.00 MiB/s │     2.63s


Sent 1073741824 bytes on 1 streams in 2.63s (388.66 MiB/s)

Stream metrics:

      │  Throughput   │ Duration 
──────┼───────────────┼──────────
 AVG  │  388.62 MiB/s │     2.64s
 P0   │  388.50 MiB/s │     2.63s
 P10  │  388.75 MiB/s │     2.64s
 P50  │  388.75 MiB/s │     2.64s
 P90  │  388.75 MiB/s │     2.64s
 P100 │  388.75 MiB/s │     2.64s


Sent 1073741824 bytes on 1 streams in 2.65s (386.73 MiB/s)

Stream metrics:

      │  Throughput   │ Duration 
──────┼───────────────┼──────────
 AVG  │  386.62 MiB/s │     2.65s
 P0   │  386.50 MiB/s │     2.65s
 P10  │  386.75 MiB/s │     2.65s
 P50  │  386.75 MiB/s │     2.65s
 P90  │  386.75 MiB/s │     2.65s
 P100 │  386.75 MiB/s │     2.65s

Quinn Master:

Sent 1073741824 bytes on 1 streams in 2.64s (387.17 MiB/s)

Stream metrics:

      │  Throughput   │ Duration 
──────┼───────────────┼──────────
 AVG  │  387.12 MiB/s │     2.65s
 P0   │  387.00 MiB/s │     2.64s
 P10  │  387.25 MiB/s │     2.65s
 P50  │  387.25 MiB/s │     2.65s
 P90  │  387.25 MiB/s │     2.65s
 P100 │  387.25 MiB/s │     2.65s


Sent 1073741824 bytes on 1 streams in 2.64s (387.34 MiB/s)

Stream metrics:

      │  Throughput   │ Duration 
──────┼───────────────┼──────────
 AVG  │  387.38 MiB/s │     2.64s
 P0   │  387.25 MiB/s │     2.64s
 P10  │  387.50 MiB/s │     2.64s
 P50  │  387.50 MiB/s │     2.64s
 P90  │  387.50 MiB/s │     2.64s
 P100 │  387.50 MiB/s │     2.64s


Sent 1073741824 bytes on 1 streams in 2.65s (386.21 MiB/s)

Stream metrics:

      │  Throughput   │ Duration 
──────┼───────────────┼──────────
 AVG  │  386.12 MiB/s │     2.65s
 P0   │  386.00 MiB/s │     2.65s
 P10  │  386.25 MiB/s │     2.65s
 P50  │  386.25 MiB/s │     2.65s
 P90  │  386.25 MiB/s │     2.65s
 P100 │  386.25 MiB/s │     2.65s

The forked version is slightly faster, though, there are many reasons why that could be (e.g., OS). Since the difference is basically statistically insignificant, we shouldn't consider the change harmful.

Copy link
Collaborator

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some things that stood out to me -- not quite a detailed review yet.

@@ -515,8 +517,7 @@ impl Crypto {
}

pub struct Iter {
// TODO: ditch io::Cursor after bytes 0.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed this TODO without resolving it -- wondering why?

Copy link
Author

@tbraun96 tbraun96 Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it's been reverted. Originally, I tried solving the problem, but the tests failed so I accidently reverted everything except that TODO. How might the new TODO be solved? I could include that in this PR if you want.

}

impl Datagram {
pub(crate) fn assert_incoming(self) -> BytesMut {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not panic here. Instead, call the function incoming() and have it return Option<BytesMut>.

…nged assert_incoming to return an Option instead of panicking
let len = self.bytes.get_var()?;
if len > self.bytes.remaining() as u64 {
return Err(UnexpectedEnd);
}
let start = self.bytes.position() as usize;
self.bytes.advance(len as usize);
Ok(self.bytes.get_ref().slice(start..(start + len as usize)))
Ok(BytesMut::from(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this now creates a deep copy of all incoming data, where there hasn't been one before.

Since includes datagram frames, it should effectively negate the whole purpose of this PR?

My guess is that it didn't show up more negative in the benchmark since that one just uses a single stream and has always sufficient data to send, to the application runs into the take_remaining case which stays cheap. But if people try to send smaller datagrams or use small streams it would it get more expensive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since includes datagram frames, it should effectively negate the whole purpose of this PR?

True. I wasn't aware slice was a cheap operation; I somehow thought it was a copy.

Going through this all, it seems this PR won't work without significantly changing the buffering mechanism. Have any ideas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like one thing we could do is to buffer BytesMuts throughout the (quinn-proto) stack. That way we can split them wherever necessary, I think, without copying. Would that make sense?

@@ -633,7 +637,7 @@ impl Iter {
Frame::Ack(Ack {
delay,
largest,
additional: self.bytes.get_ref().slice(start..end),
additional: Bytes::copy_from_slice(&self.bytes.get_ref()[start..end] as &[u8]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like another new deep copy?

I don't think it should do this. If it already has to allocate, it should at least immediately decode the ACKs blocks and have those stored in a list, so that this work goes away later on.

@Ralith Ralith changed the title Resolution for issue #1173 Return owned memory via BytesMut to reduce downstream copying Sep 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants