New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return owned memory via BytesMut to reduce downstream copying #1182
base: main
Are you sure you want to change the base?
Conversation
What is the difference between the two commits you have here? |
There was a warning of an unused import That I cleared. Also, just made a new commit. |
Might I also add that in order for this to work, publicly-exposed functions like |
Thanks for the PR!
Interfaces that consume data should always take |
…o minimize unnecessary copying
Okay, I adjusted it. Public functions are now normal as desired. Had to make |
quinn-proto/src/connection/mod.rs
Outdated
@@ -1651,10 +1651,10 @@ where | |||
} | |||
} | |||
let offset = self.spaces[space].crypto_offset; | |||
let outgoing = Bytes::from(outgoing); | |||
let outgoing = BytesMut::from(&outgoing[..]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be extremely careful with this. BytesMut
is implemented different than Bytes
, and might have vastly different allocation and performance implications. Please rule that out and run benchmarks to check that nothing changes.
I would highly recommend not to touch this code. And I actually doubt it needs to, since the idea was just about making the received bytes available as BytesMut
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yeah. Just for starters, this is introducing an an alloc + copy, and the clone().freeze()
below is introducing another. While that's probably not going to be a bottleneck, it shouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Looks like that change was rendered unnecessary after using the enum to differentiate between ingoing/outgoing type. That code no longer performs an unnecessary copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at it again. However this wasn't just a comment on one particular instance. This affects more places in the CR which are potentially degrading performance.
The CR should have attached benchmark results which really show that this isn't degrading performance for existing use-cases, and increases the performance for the new use-case.
Without this, it looks like there is a hypothetical issue for a rare use-case, a change tries to address this, and without any confirmation whether the change actually helps other parts might get degraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should I test for it? Have one test spamming Unreliable UDP packets to a receiving end (localhost->localhost) using the current master branch, then another test doing the same but on my fork?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can look at the benchmark that exists for streams: https://github.com/quinn-rs/quinn/tree/main/bench
Run that before and after your change. And check that your socket buffers are set high enough (> 1MB) that those are not the bottleneck.
A similar benchmark can be written for datagram transmission
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both benchmarked on release compilation. Testing the vanilla benchmark.
Macbook Pro M1
Forked Version:
Sent 1073741824 bytes on 1 streams in 2.63s (388.88 MiB/s)
Stream metrics:
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 388.88 MiB/s │ 2.63s
P0 │ 388.75 MiB/s │ 2.63s
P10 │ 389.00 MiB/s │ 2.63s
P50 │ 389.00 MiB/s │ 2.63s
P90 │ 389.00 MiB/s │ 2.63s
P100 │ 389.00 MiB/s │ 2.63s
Sent 1073741824 bytes on 1 streams in 2.63s (388.66 MiB/s)
Stream metrics:
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 388.62 MiB/s │ 2.64s
P0 │ 388.50 MiB/s │ 2.63s
P10 │ 388.75 MiB/s │ 2.64s
P50 │ 388.75 MiB/s │ 2.64s
P90 │ 388.75 MiB/s │ 2.64s
P100 │ 388.75 MiB/s │ 2.64s
Sent 1073741824 bytes on 1 streams in 2.65s (386.73 MiB/s)
Stream metrics:
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 386.62 MiB/s │ 2.65s
P0 │ 386.50 MiB/s │ 2.65s
P10 │ 386.75 MiB/s │ 2.65s
P50 │ 386.75 MiB/s │ 2.65s
P90 │ 386.75 MiB/s │ 2.65s
P100 │ 386.75 MiB/s │ 2.65s
Quinn Master:
Sent 1073741824 bytes on 1 streams in 2.64s (387.17 MiB/s)
Stream metrics:
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 387.12 MiB/s │ 2.65s
P0 │ 387.00 MiB/s │ 2.64s
P10 │ 387.25 MiB/s │ 2.65s
P50 │ 387.25 MiB/s │ 2.65s
P90 │ 387.25 MiB/s │ 2.65s
P100 │ 387.25 MiB/s │ 2.65s
Sent 1073741824 bytes on 1 streams in 2.64s (387.34 MiB/s)
Stream metrics:
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 387.38 MiB/s │ 2.64s
P0 │ 387.25 MiB/s │ 2.64s
P10 │ 387.50 MiB/s │ 2.64s
P50 │ 387.50 MiB/s │ 2.64s
P90 │ 387.50 MiB/s │ 2.64s
P100 │ 387.50 MiB/s │ 2.64s
Sent 1073741824 bytes on 1 streams in 2.65s (386.21 MiB/s)
Stream metrics:
│ Throughput │ Duration
──────┼───────────────┼──────────
AVG │ 386.12 MiB/s │ 2.65s
P0 │ 386.00 MiB/s │ 2.65s
P10 │ 386.25 MiB/s │ 2.65s
P50 │ 386.25 MiB/s │ 2.65s
P90 │ 386.25 MiB/s │ 2.65s
P100 │ 386.25 MiB/s │ 2.65s
The forked version is slightly faster, though, there are many reasons why that could be (e.g., OS). Since the difference is basically statistically insignificant, we shouldn't consider the change harmful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's some things that stood out to me -- not quite a detailed review yet.
@@ -515,8 +517,7 @@ impl Crypto { | |||
} | |||
|
|||
pub struct Iter { | |||
// TODO: ditch io::Cursor after bytes 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You removed this TODO without resolving it -- wondering why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, it's been reverted. Originally, I tried solving the problem, but the tests failed so I accidently reverted everything except that TODO. How might the new TODO be solved? I could include that in this PR if you want.
quinn-proto/src/frame.rs
Outdated
} | ||
|
||
impl Datagram { | ||
pub(crate) fn assert_incoming(self) -> BytesMut { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not panic here. Instead, call the function incoming()
and have it return Option<BytesMut>
.
…nged assert_incoming to return an Option instead of panicking
let len = self.bytes.get_var()?; | ||
if len > self.bytes.remaining() as u64 { | ||
return Err(UnexpectedEnd); | ||
} | ||
let start = self.bytes.position() as usize; | ||
self.bytes.advance(len as usize); | ||
Ok(self.bytes.get_ref().slice(start..(start + len as usize))) | ||
Ok(BytesMut::from( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this now creates a deep copy of all incoming data, where there hasn't been one before.
Since includes datagram frames, it should effectively negate the whole purpose of this PR?
My guess is that it didn't show up more negative in the benchmark since that one just uses a single stream and has always sufficient data to send, to the application runs into the take_remaining
case which stays cheap. But if people try to send smaller datagrams or use small streams it would it get more expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since includes datagram frames, it should effectively negate the whole purpose of this PR?
True. I wasn't aware slice
was a cheap operation; I somehow thought it was a copy.
Going through this all, it seems this PR won't work without significantly changing the buffering mechanism. Have any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like one thing we could do is to buffer BytesMut
s throughout the (quinn-proto) stack. That way we can split them wherever necessary, I think, without copying. Would that make sense?
@@ -633,7 +637,7 @@ impl Iter { | |||
Frame::Ack(Ack { | |||
delay, | |||
largest, | |||
additional: self.bytes.get_ref().slice(start..end), | |||
additional: Bytes::copy_from_slice(&self.bytes.get_ref()[start..end] as &[u8]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like another new deep copy?
I don't think it should do this. If it already has to allocate, it should at least immediately decode the ACKs blocks and have those stored in a list, so that this work goes away later on.
Compiles and runs. Tests return Ok(())