New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1-5 Mbit/s upload throughput in container but GCP H100 host gives >2000 Mbit/s #10344
Comments
Thanks for the report. Would it be possible to get equivalent pcaps for runc/runsc on the A100 where you don't see the issue? |
Also just to be sure could you confirm if the A100/H100 are running in the same region? |
Yeh this is weird. I just did I'll capture from an A100 and check if the same weirdness is present. |
Oh, I see the issue: You can target a specific interface. Also, the wireshark logs can be filtered via |
Thanks for extra logs, we're still investigating on our end. Could you send what you get from running |
Throughput could be lowered by entering fast recovery unnecessarily. When a larger-than-MTU segment was retransmitted as multiple segments, loss detection could fire either because we hit the dupack threshold or RACK detected loss due to ACKs from retransmissions. RACK was more succeptible to this because it's better at detecting loss and can do so even without 3 dupacks. Thus it fell into this trap more often. Addresses #10344. PiperOrigin-RevId: 631967721
😅 jeez, it's rough out there. H100
A100
|
Throughput could be lowered by entering fast recovery unnecessarily. When a larger-than-MTU segment was retransmitted as multiple segments, loss detection could fire either because we hit the dupack threshold or RACK detected loss due to ACKs from retransmissions. RACK was more succeptible to this because it's better at detecting loss and can do so even without 3 dupacks. Thus it fell into this trap more often. Addresses #10344. PiperOrigin-RevId: 631967721
We've supported PMTUD for a long time and just never turned it on. Addresses #10344. PiperOrigin-RevId: 632186215
I have a theory as to what's going on, although there are a couple open questions WRT the packet captures. Looking at the A100 (fast) capture, we see a healthy connection. Packets are sent that appear larger than MTU, but that's because we're using GSO to defer segmenting to the last minute. The H100 (slow) logs seem to retransmit all larger than MTU packets, indicating that we're not GSOing correctly. That makes sense: we're supplying 1500 as the MTU. This is what 94c1024 addresses: we should be using the MTU of the device interface, not the container's. Our loss detection (TCP RACK) appears to resend non-GSO'd segments, so when we send a too-large packet it can be seen getting retransmitted in smaller chunks. You can see an example of this right at the start of stream 20 (filter The two actually confusing things are: (1) why this is different on the two machines. They both run 1500 byte MTU containers with a 1460 byte MTU NIC. I'll try to figure it out, but for now ¯\_(ツ)_/¯. And (2), why don't we see an ICMP fragmentation needed packet in the logs? Anyways, I think 94c1024 will fix this. @manninglucas: should we put logic similar to that in the default runsc boot process, where it uses the MTU of the default device iff there's an obvious default? #10419 should also help cover more cases where the PMTU causes problems. |
We've supported PMTUD for a long time and just never turned it on. Addresses #10344. PiperOrigin-RevId: 632186215
Throughput could be lowered by entering fast recovery unnecessarily. When a larger-than-MTU segment was retransmitted as multiple segments, loss detection could fire either because we hit the dupack threshold or RACK detected loss due to ACKs from retransmissions. RACK was more succeptible to this because it's better at detecting loss and can do so even without 3 dupacks. Thus it fell into this trap more often. Addresses #10344. PiperOrigin-RevId: 631967721
@thundergolfer In addition to testing at head with 94c1024, do you know whether you do anything to change the MTU inside runsc or the network namespace in which it runs? |
We've supported PMTUD for a long time and just never turned it on. Addresses #10344. PiperOrigin-RevId: 632186215
Noted also that H100 instances appear to always use gVNIC, while A100 can use gVNIC or virtio, defaulting to the latter I believe. Maybe part of the issue, but I'm not seeing issues when trying to repro on a gVNIC machine. |
Thanks for the details comments @kevinGC! I think they mostly make sense to me, but I'll work through the details more carefully tomorrow while also testing out 94c1024. To answer your follow-up question, in our CNI bridge plugin configuration we set the MTU to 1460 on GCP workers because our VPC has that set as the MTU. We first observed networking problems when running A3 instances on GCP. We'd had no trouble before. This is what we observed:
This very well could be why we didn't have an issue until using A3 instances. |
Testing result:
Host view of network device (on a different but equivalent
Status quo
Host view of network device (on a different but equivalent
Unexpected results. So with status quo Seems odd to me that the host shows MTU is still 1500 even though the in 94c1024 the container is now picking up the MTU of |
We've supported PMTUD for a long time and just never turned it on. Addresses #10344. PiperOrigin-RevId: 632186215
As a workaround: you can pass We have some ideas regarding the root cause -- the H100's NIC driver may be in some way different -- that we'll keep looking into for now. |
We've supported PMTUD for a long time and just never turned it on. Addresses #10344. PiperOrigin-RevId: 632186215
We've supported PMTUD for a long time and just never turned it on. Addresses #10344. PiperOrigin-RevId: 634003508
Throughput could be lowered by entering fast recovery unnecessarily. When a larger-than-MTU segment was retransmitted as multiple segments, loss detection could fire either because we hit the dupack threshold or RACK detected loss due to ACKs from retransmissions. RACK was more succeptible to this because it's better at detecting loss and can do so even without 3 dupacks. Thus it fell into this trap more often. Addresses #10344. PiperOrigin-RevId: 631967721
There's still the same 0.00 Mbit/s download with We have used |
Throughput could be lowered by entering fast recovery unnecessarily. When a larger-than-MTU segment was retransmitted as multiple segments, loss detection could fire either because we hit the dupack threshold or RACK detected loss due to ACKs from retransmissions. RACK was more succeptible to this because it's better at detecting loss and can do so even without 3 dupacks. Thus it fell into this trap more often. Addresses #10344. PiperOrigin-RevId: 631967721
Throughput could be lowered by entering fast recovery unnecessarily. When a larger-than-MTU segment was retransmitted as multiple segments, loss detection could fire either because we hit the dupack threshold or RACK detected loss due to ACKs from retransmissions. RACK was more succeptible to this because it's better at detecting loss and can do so even without 3 dupacks. Thus it fell into this trap more often. Addresses #10344. PiperOrigin-RevId: 634071568
Glad that flag works. Not sure where that awful download stat comes from; I don't see it when I try to replicate at any commit. Will keep looking, especially if you're still seeing it after these patches. |
I spent a few hours trying to figure out what can be wrong with our gso packets. At some point, I started thinking that we were looking for a black cat in a dark room. I decided to test this version by running a kata container and checking that the issue is reproducible in that environment. A kata container is a virtual machine with a virtio network device. It injects gso packets from guest to the host linux kernel in a similar way as gvisor does, but they use different kernel API to do that. Inside a kata vm, the linux kernel is running, so it is completely unrelated to the gVisor netstack. It was not a surprise when I found that the same issue is triggered in kata containers:
In summary, I'm inclined to believe that this issue isn't tied to gVisor. More likely, it resides either within the Linux kernel itself or within the gvnic device or its driver. |
To those interested, @avagin found the cause of this bug. It's a small issue with the GVE network driver that's used on some GCP hardware. The driver code can be found here. This code in the driver drops a packet if the GSO type isn't exactly equal to We will try to expedite a fix in the GVE driver the best we can from our end. Filing a formal support ticket with GCP may help move along the process as well. Closing this issue now as it is not a bug with gVisor. |
Nice one! |
Description
Within gVisor runsc we're seeing extremely low upload performance on GCP H100 instances specifically. We don't have these issues on GCP A100 instances.
I have attached
pcap
data below in place of runsc debug logs. Let me know of any other info I should gather 🙂.runsc
host
runc
Steps to reproduce
I unfortunately don't have much of a chance of getting a devbox with an H100 on it. But we're just doing this:
We see similar upload performance problems when uploading to Cloudflare R2.
runsc version
docker version (if using docker)
No response
uname
Linux gcp-h100-us-east4-a-0-c965c22f-6d1f-416d-b245-395141187d95 5.15.0-205.149.5.1.el9uek.x86_64 #2 SMP Fri Apr 5 11:29:36 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
output.pcap.zip (gvisor)
output_runc.pcap.zip
The text was updated successfully, but these errors were encountered: