Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1-5 Mbit/s upload throughput in container but GCP H100 host gives >2000 Mbit/s #10344

Closed
thundergolfer opened this issue May 1, 2024 · 19 comments
Assignees
Labels
area: networking Issue related to networking type: bug Something isn't working

Comments

@thundergolfer
Copy link
Contributor

thundergolfer commented May 1, 2024

Description

Within gVisor runsc we're seeing extremely low upload performance on GCP H100 instances specifically. We don't have these issues on GCP A100 instances.

I have attached pcap data below in place of runsc debug logs. Let me know of any other info I should gather 🙂.

runsc

Retrieving speedtest.net configuration...
Testing from Google Cloud (35.221.7.106)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Cox - Nova (Fairfax, VA) [23.95 km]: 2.637 ms
Skipping download test
Testing upload speed......................................................................................................
Upload: 1.58 Mbit/s

host

[modal@gcp-h100-us-east4-a-0-c965c22f-6d1f-416d-b245-395141187d95 ~]$ ./speedtest-cli
Retrieving speedtest.net configuration...
Testing from Google Cloud (35.221.7.106)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Pilot Fiber (Ashburn, VA) [42.41 km]: 2.762 ms
Testing download speed................................................................................
Download: 3275.77 Mbit/s
Testing upload speed......................................................................................................
Upload: 2456.03 Mbit/s

runc

Retrieving speedtest.net configuration...
Testing from Google Cloud (35.221.7.106)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Cox - Nova (Fairfax, VA) [23.95 km]: 2.467 ms
Skipping download test
Testing upload speed......................................................................................................
Upload: 808.07 Mbit/s

Steps to reproduce

I unfortunately don't have much of a chance of getting a devbox with an H100 on it. But we're just doing this:

curl -Lo speedtest-cli https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py
chmod +x speedtest-cli
./speedtest-cli

We see similar upload performance problems when uploading to Cloudflare R2.

runsc version

version release-20230717.0-12-g0244c8c19fb7
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

Linux gcp-h100-us-east4-a-0-c965c22f-6d1f-416d-b245-395141187d95 5.15.0-205.149.5.1.el9uek.x86_64 #2 SMP Fri Apr 5 11:29:36 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

output.pcap.zip (gvisor)
output_runc.pcap.zip

@thundergolfer thundergolfer added the type: bug Something isn't working label May 1, 2024
@ayushr2 ayushr2 added the area: networking Issue related to networking label May 1, 2024
@manninglucas manninglucas self-assigned this May 1, 2024
@manninglucas
Copy link
Contributor

Thanks for the report. Would it be possible to get equivalent pcaps for runc/runsc on the A100 where you don't see the issue?

@manninglucas
Copy link
Contributor

Also just to be sure could you confirm if the A100/H100 are running in the same region?

@kevinGC
Copy link
Collaborator

kevinGC commented May 2, 2024

Looking at the pcaps for both runsc and runc, it looks like every packet is repeated. Even the initial SYN shows up twice, and this isn't normal "TCP is trying again after a timeout" behavior -- there's only a 7µs gap between packets. Any idea why? It really messes with wireshark.

Screenshot 2024-05-02 at 9 42 55 AM

@thundergolfer
Copy link
Contributor Author

Yeh this is weird. I just did sudo tcpdump -i any -w output.pcap host $CONTAINER_IP on the host.

I'll capture from an A100 and check if the same weirdness is present.

@kevinGC
Copy link
Collaborator

kevinGC commented May 2, 2024

Oh, I see the issue: -i any is getting the packet on two interfaces. You can see it switch between two Interface index values in wireshark. I'm assuming this is capturing the packet once on the host NIC and once on the virtual ethernet device that runsc is using.

You can target a specific interface. Also, the wireshark logs can be filtered via sll.ifindex == <index>, but it's messier and more cumbersome.

@thundergolfer
Copy link
Contributor Author

thundergolfer commented May 7, 2024

Ok, have some cleaner captures, captured with:

H100 Details

  • Upload inside container: 1-5 Mbit/s
  • Instance type: a3-highgpu-8g
  • Zone: us-east4-a
  • Instance ID (GCP): 7905391184063231222

A100 details

  • Upload inside container: 220 Mbit/s - 650 Mbit/s
  • Instance Type: a2-megagpu-16g
  • Zone: us-central1-c
  • Instance ID (GCP): 7944836249609026522

I'm a Wireshark novice, but the H100 dump (left) is full of duplicate ACKs and packet retransmissions, whereas the A100 dump (right) is clean.

image

@manninglucas
Copy link
Contributor

Thanks for extra logs, we're still investigating on our end. Could you send what you get from running ip link show on the H100 and A100? Believe it or not we have trouble getting access to these types of machines even for our own testing.

copybara-service bot pushed a commit that referenced this issue May 8, 2024
Throughput could be lowered by entering fast recovery unnecessarily. When a
larger-than-MTU segment was retransmitted as multiple segments, loss detection
could fire either because we hit the dupack threshold or RACK detected loss due
to ACKs from retransmissions.

RACK was more succeptible to this because it's better at detecting loss and can
do so even without 3 dupacks. Thus it fell into this trap more often.

Addresses #10344.

PiperOrigin-RevId: 631967721
@thundergolfer
Copy link
Contributor Author

Believe it or not we have trouble getting access to these types of machines even for our own testing.

😅 jeez, it's rough out there.

H100

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 42:01:ac:1e:50:2f brd ff:ff:ff:ff:ff:ff
    altname enp0s12
3: modalsvc0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:5f:c2:c0:ba:fb brd ff:ff:ff:ff:ff:ff
4: modal59: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 02:c9:6a:77:d2:5a brd ff:ff:ff:ff:ff:ff
6: modal57: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 46:a1:b0:76:32:e2 brd ff:ff:ff:ff:ff:ff
8: modal9: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 7a:57:61:ca:cd:71 brd ff:ff:ff:ff:ff:ff
10: modal45: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 56:57:5a:b7:a2:bf brd ff:ff:ff:ff:ff:ff
12: modal47: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether b6:16:6e:a0:79:87 brd ff:ff:ff:ff:ff:ff
14: modal16: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e6:34:96:7c:b1:da brd ff:ff:ff:ff:ff:ff
17: modal34: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:3d:b7:c2:ee:40 brd ff:ff:ff:ff:ff:ff
20: modal40: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 9a:ec:c6:00:70:72 brd ff:ff:ff:ff:ff:ff
22: modal1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:62:e6:1c:ac:c4 brd ff:ff:ff:ff:ff:ff
24: modal35: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 46:95:a9:b1:c2:9c brd ff:ff:ff:ff:ff:ff
3097: veth07e2841d@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue master modal11 state UP mode DEFAULT group default
    link/ether 86:41:b8:7d:97:43 brd ff:ff:ff:ff:ff:ff link-netns wFJGhMiC7nz
26: modal4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether a2:c0:b5:38:3a:39 brd ff:ff:ff:ff:ff:ff
31: modal52: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 9a:ba:2f:48:ee:87 brd ff:ff:ff:ff:ff:ff
34: modal20: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether d2:63:ee:c5:1a:bc brd ff:ff:ff:ff:ff:ff
291: modal49: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 66:d3:7d:55:a2:ac brd ff:ff:ff:ff:ff:ff
36: modal54: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 32:ad:6a:af:9b:04 brd ff:ff:ff:ff:ff:ff
38: modal53: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 7a:b5:86:bf:59:31 brd ff:ff:ff:ff:ff:ff
40: modal27: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e6:c2:34:dc:3c:e3 brd ff:ff:ff:ff:ff:ff
42: modal55: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether b2:eb:9e:ff:02:85 brd ff:ff:ff:ff:ff:ff
44: modal60: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 92:05:ba:e9:53:03 brd ff:ff:ff:ff:ff:ff
301: modal36: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether f6:6f:d3:9e:21:b8 brd ff:ff:ff:ff:ff:ff
46: modal28: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 4a:77:35:9c:fc:3f brd ff:ff:ff:ff:ff:ff
49: modal51: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 5e:b0:16:43:cb:45 brd ff:ff:ff:ff:ff:ff
52: modal23: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 6a:31:ff:23:b4:ca brd ff:ff:ff:ff:ff:ff
54: modal46: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 86:3d:45:1d:48:99 brd ff:ff:ff:ff:ff:ff
56: modal41: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 6e:47:98:07:b8:99 brd ff:ff:ff:ff:ff:ff
58: modal25: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ce:e2:e5:b3:b2:29 brd ff:ff:ff:ff:ff:ff
60: modal15: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 2a:76:1a:e1:6e:d2 brd ff:ff:ff:ff:ff:ff
64: modal30: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 16:f9:a3:52:03:fa brd ff:ff:ff:ff:ff:ff
66: modal48: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ca:0d:c9:67:dd:67 brd ff:ff:ff:ff:ff:ff
324: modal32: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 0a:64:ed:d2:5c:24 brd ff:ff:ff:ff:ff:ff
68: modal62: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 7e:92:07:97:36:6c brd ff:ff:ff:ff:ff:ff
72: modal2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether da:3b:f8:97:bf:09 brd ff:ff:ff:ff:ff:ff
76: modal24: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 26:cb:20:2b:83:77 brd ff:ff:ff:ff:ff:ff
81: modal38: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 4e:c5:e7:7e:00:dc brd ff:ff:ff:ff:ff:ff
83: modal50: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 1a:ca:b6:59:2a:15 brd ff:ff:ff:ff:ff:ff
86: modal6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 22:dc:66:3c:47:3c brd ff:ff:ff:ff:ff:ff
89: modal17: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ee:7d:22:37:39:cd brd ff:ff:ff:ff:ff:ff
92: modal14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ba:d7:59:92:89:89 brd ff:ff:ff:ff:ff:ff
94: modal63: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 2a:8f:23:5a:97:fd brd ff:ff:ff:ff:ff:ff
100: modal3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 02:37:3f:8c:32:d4 brd ff:ff:ff:ff:ff:ff
102: modal43: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:48:c5:9e:89:0f brd ff:ff:ff:ff:ff:ff
104: modal0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ba:ab:65:70:03:a8 brd ff:ff:ff:ff:ff:ff
106: modal31: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 0a:94:08:e1:fe:6c brd ff:ff:ff:ff:ff:ff
111: modal61: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 9e:2a:d8:6d:27:c7 brd ff:ff:ff:ff:ff:ff
113: modal12: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether d6:4e:c3:ff:31:e0 brd ff:ff:ff:ff:ff:ff
115: modal19: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ea:35:f9:9e:44:44 brd ff:ff:ff:ff:ff:ff
117: modal21: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 42:13:9d:f8:c0:68 brd ff:ff:ff:ff:ff:ff
120: modal58: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 16:f8:85:69:87:72 brd ff:ff:ff:ff:ff:ff
5243: veth31f0a7f6@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue master modal4 state UP mode DEFAULT group default
    link/ether 9e:52:78:fd:ba:69 brd ff:ff:ff:ff:ff:ff link-netns unwYFZtG8ao
135: modal26: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether f6:5c:f8:e9:d7:21 brd ff:ff:ff:ff:ff:ff
138: modal33: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 32:7a:39:40:2c:c7 brd ff:ff:ff:ff:ff:ff
140: modal22: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether de:6c:6c:64:7e:9b brd ff:ff:ff:ff:ff:ff
150: modal44: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 02:76:ba:b5:d9:b6 brd ff:ff:ff:ff:ff:ff
156: modal37: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 1a:ba:63:12:fd:c9 brd ff:ff:ff:ff:ff:ff
164: modal29: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 6a:2c:3c:c3:9f:8a brd ff:ff:ff:ff:ff:ff
168: modal42: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ce:ff:6d:19:57:86 brd ff:ff:ff:ff:ff:ff
171: modal10: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e2:6f:75:52:d0:4c brd ff:ff:ff:ff:ff:ff
176: modal39: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:9f:0d:e6:83:02 brd ff:ff:ff:ff:ff:ff
180: modal18: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 5e:5e:9e:ad:88:7b brd ff:ff:ff:ff:ff:ff
186: modal7: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 26:25:25:9c:15:a5 brd ff:ff:ff:ff:ff:ff
194: modal56: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e2:2b:1b:3a:1a:e1 brd ff:ff:ff:ff:ff:ff
201: modal11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether a2:8c:b9:9b:ff:62 brd ff:ff:ff:ff:ff:ff
214: modal5: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:41:43:13:b2:e7 brd ff:ff:ff:ff:ff:ff
223: modal8: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 72:c0:7b:4f:1f:b5 brd ff:ff:ff:ff:ff:ff
239: modal13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 32:a5:8d:fa:2e:2d brd ff:ff:ff:ff:ff:ff

A100

[modal@gcp-a100-80gb-spot-europe-west4-a-0-70db3533-efb2-4ff1-86e2-ed9 ~]$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 42:01:ac:1e:70:09 brd ff:ff:ff:ff:ff:ff
    altname enp0s9
    altname ens9
3: modalsvc0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 66:65:81:26:53:f5 brd ff:ff:ff:ff:ff:ff
4: modal35: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether aa:1d:c0:a3:66:9f brd ff:ff:ff:ff:ff:ff
6: modal22: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether b6:02:95:74:73:a4 brd ff:ff:ff:ff:ff:ff
8: modal2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 26:f4:6d:e1:80:df brd ff:ff:ff:ff:ff:ff
10: modal62: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e6:51:9f:ec:ef:78 brd ff:ff:ff:ff:ff:ff
12: modal60: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 5a:97:2c:03:c4:fc brd ff:ff:ff:ff:ff:ff
14: modal57: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:fe:bb:61:f8:f1 brd ff:ff:ff:ff:ff:ff
16: modal17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f2:80:dc:e6:43:29 brd ff:ff:ff:ff:ff:ff
17: vetha0b77138@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue master modal17 state UP mode DEFAULT group default
    link/ether e6:30:bf:c5:4d:10 brd ff:ff:ff:ff:ff:ff link-netns xuZY6HwlXOW
18: modal29: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 26:53:e6:c8:97:3e brd ff:ff:ff:ff:ff:ff
20: modal41: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether fa:db:f4:cc:41:44 brd ff:ff:ff:ff:ff:ff
22: modal23: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether da:da:fc:8b:2e:b7 brd ff:ff:ff:ff:ff:ff
24: modal40: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ce:57:b5:d9:49:f1 brd ff:ff:ff:ff:ff:ff
26: modal53: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ca:ea:c1:e0:0f:9e brd ff:ff:ff:ff:ff:ff
28: modal0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 2e:51:ad:58:b0:16 brd ff:ff:ff:ff:ff:ff
30: modal14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether fa:4d:10:f3:bc:e9 brd ff:ff:ff:ff:ff:ff
34: modal42: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:9a:5c:fc:39:c3 brd ff:ff:ff:ff:ff:ff
36: modal10: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 12:3a:c8:b3:88:5d brd ff:ff:ff:ff:ff:ff
38: modal15: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 8e:38:02:02:1e:fa brd ff:ff:ff:ff:ff:ff
40: modal43: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 66:82:8a:4b:c9:4d brd ff:ff:ff:ff:ff:ff

copybara-service bot pushed a commit that referenced this issue May 9, 2024
Throughput could be lowered by entering fast recovery unnecessarily. When a
larger-than-MTU segment was retransmitted as multiple segments, loss detection
could fire either because we hit the dupack threshold or RACK detected loss due
to ACKs from retransmissions.

RACK was more succeptible to this because it's better at detecting loss and can
do so even without 3 dupacks. Thus it fell into this trap more often.

Addresses #10344.

PiperOrigin-RevId: 631967721
copybara-service bot pushed a commit that referenced this issue May 9, 2024
We've supported PMTUD for a long time and just never turned it on.

Addresses #10344.

PiperOrigin-RevId: 632186215
@kevinGC
Copy link
Collaborator

kevinGC commented May 9, 2024

I have a theory as to what's going on, although there are a couple open questions WRT the packet captures. Looking at the A100 (fast) capture, we see a healthy connection. Packets are sent that appear larger than MTU, but that's because we're using GSO to defer segmenting to the last minute.

The H100 (slow) logs seem to retransmit all larger than MTU packets, indicating that we're not GSOing correctly. That makes sense: we're supplying 1500 as the MTU. This is what 94c1024 addresses: we should be using the MTU of the device interface, not the container's.

Our loss detection (TCP RACK) appears to resend non-GSO'd segments, so when we send a too-large packet it can be seen getting retransmitted in smaller chunks. You can see an example of this right at the start of stream 20 (filter tcp.stream eq 20), where we send 4140 and 2760 B packets that take ~0.8s to get retransmitted in smaller segments. I'm not sure why we send the smaller RACK segments -- could be an implementation detail, could be part of the RFC.

The two actually confusing things are: (1) why this is different on the two machines. They both run 1500 byte MTU containers with a 1460 byte MTU NIC. I'll try to figure it out, but for now ¯\_(ツ)_/¯. And (2), why don't we see an ICMP fragmentation needed packet in the logs?

Anyways, I think 94c1024 will fix this. @manninglucas: should we put logic similar to that in the default runsc boot process, where it uses the MTU of the default device iff there's an obvious default? #10419 should also help cover more cases where the PMTU causes problems.

copybara-service bot pushed a commit that referenced this issue May 9, 2024
We've supported PMTUD for a long time and just never turned it on.

Addresses #10344.

PiperOrigin-RevId: 632186215
copybara-service bot pushed a commit that referenced this issue May 9, 2024
Throughput could be lowered by entering fast recovery unnecessarily. When a
larger-than-MTU segment was retransmitted as multiple segments, loss detection
could fire either because we hit the dupack threshold or RACK detected loss due
to ACKs from retransmissions.

RACK was more succeptible to this because it's better at detecting loss and can
do so even without 3 dupacks. Thus it fell into this trap more often.

Addresses #10344.

PiperOrigin-RevId: 631967721
@kevinGC
Copy link
Collaborator

kevinGC commented May 10, 2024

@thundergolfer In addition to testing at head with 94c1024, do you know whether you do anything to change the MTU inside runsc or the network namespace in which it runs?

copybara-service bot pushed a commit that referenced this issue May 10, 2024
We've supported PMTUD for a long time and just never turned it on.

Addresses #10344.

PiperOrigin-RevId: 632186215
@kevinGC
Copy link
Collaborator

kevinGC commented May 10, 2024

Noted also that H100 instances appear to always use gVNIC, while A100 can use gVNIC or virtio, defaulting to the latter I believe. Maybe part of the issue, but I'm not seeing issues when trying to repro on a gVNIC machine.

@thundergolfer
Copy link
Contributor Author

Thanks for the details comments @kevinGC! I think they mostly make sense to me, but I'll work through the details more carefully tomorrow while also testing out 94c1024.

To answer your follow-up question, in our CNI bridge plugin configuration we set the MTU to 1460 on GCP workers because our VPC has that set as the MTU. We first observed networking problems when running A3 instances on GCP. We'd had no trouble before. This is what we observed:

Containers on the GCP H100s have a hard time talking to the internet. They seem to manage to set up TCP connections, three way handshake succeeds, but then it seems like packets from the remote endpoint get lost and the flow stalls. For https the TLS handshake fails. - Dano from Modal

Noted also that H100 instances appear to always use gVNIC, while A100 can use gVNIC or virtio, defaulting to the latter I believe. Maybe part of the issue, but I'm not seeing issues when trying to repro on a gVNIC machine.

This very well could be why we didn't have an issue until using A3 instances.

@thundergolfer
Copy link
Contributor Author

Testing result:

94c1024

[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ ./runsc --version
runsc version release-20240506.0-13-g94c10243701c
spec: 1.1.0-rc.1
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ sudo ./runsc do ./speedtest-cli --secure
Retrieving speedtest.net configuration...
Testing from Google Cloud (34.48.63.7)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by PhoenixNAP Global IT Services (Ashburn, VA) [42.41 km]: 4.03 ms
Testing download speed................................................................................
Download: 2425.78 Mbit/s
Testing upload speed.....................................................................................................
.Upload: 4.54 Mbit/s
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ sudo ./runsc do ip link show
2: ve-runsc-443872: <UP,LOWER_UP> mtu 1460
    link/ether 16:3d:af:7b:ad:1b brd ff:ff:ff:ff:ff:ff
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65522
    link/loopback 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff

Host view of network device (on a different but equivalent do run)::

ip link show | grep -A2 runsc
1381: vp-runsc-630021@if1382: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 2e:58:0e:b3:cf:f8 brd ff:ff:ff:ff:ff:ff link-netns runsc-630021

Status quo

[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ ./production/runsc --version
runsc version 6e61813c1b37
spec: 1.1.0-rc.1
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ sudo ./production/runsc do ./speedtest-cli --secure
Retrieving speedtest.net configuration...
Testing from Google Cloud (34.48.63.7)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by PhoenixNAP Global IT Services (Ashburn, VA) [42.41 km]: 2.583 ms
Testing download speed................................................................................
Download: 0.00 Mbit/s
Testing upload speed......................................................................................................
Upload: 34.14 Mbit/s
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ sudo ./production/runsc do ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65522
    link/loopback 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
2: ve-runsc-154717: <UP,LOWER_UP> mtu 1500
    link/ether 2a:5c:07:9e:7d:e7 brd ff:ff:ff:ff:ff:ff

Host view of network device (on a different but equivalent do run):

[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ ip link show | grep -A2 runsc
1383: vp-runsc-012936@if1384: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ea:b9:d9:fe:06:a7 brd ff:ff:ff:ff:ff:ff link-netns runsc-012936

Unexpected results. So with status quo runsc version do has zero download throughput but some upload throughput. With 94c1024 runsc version the download is high but upload is lower than status quo?

Seems odd to me that the host shows MTU is still 1500 even though the in 94c1024 the container is now picking up the MTU of eth0.

copybara-service bot pushed a commit that referenced this issue May 14, 2024
We've supported PMTUD for a long time and just never turned it on.

Addresses #10344.

PiperOrigin-RevId: 632186215
@kevinGC
Copy link
Collaborator

kevinGC commented May 15, 2024

As a workaround: you can pass runsc a --gso=false flag that should get you close to native speeds. @manninglucas got us a test machine and -- with our particular setup -- the upload throughput goes from ~1.65 Mbps to 745 Mbps!

We have some ideas regarding the root cause -- the H100's NIC driver may be in some way different -- that we'll keep looking into for now.

copybara-service bot pushed a commit that referenced this issue May 15, 2024
We've supported PMTUD for a long time and just never turned it on.

Addresses #10344.

PiperOrigin-RevId: 632186215
copybara-service bot pushed a commit that referenced this issue May 15, 2024
We've supported PMTUD for a long time and just never turned it on.

Addresses #10344.

PiperOrigin-RevId: 634003508
copybara-service bot pushed a commit that referenced this issue May 15, 2024
Throughput could be lowered by entering fast recovery unnecessarily. When a
larger-than-MTU segment was retransmitted as multiple segments, loss detection
could fire either because we hit the dupack threshold or RACK detected loss due
to ACKs from retransmissions.

RACK was more succeptible to this because it's better at detecting loss and can
do so even without 3 dupacks. Thus it fell into this trap more often.

Addresses #10344.

PiperOrigin-RevId: 631967721
@thundergolfer
Copy link
Contributor Author

--gso=false does indeed improve upload!

[modal@gcp-h100-us-east4-a-0-a275c742-c07d-433e-bcc0-46bf967048d7 ~]$ sudo ./production/runsc -gso=false do ./speedtest-cli --secure
Retrieving speedtest.net configuration...
Testing from Google Cloud (34.86.32.183)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Whitesky Communications LLC (Ashburn, VA) [42.41 km]: 2.482 ms
Testing download speed................................................................................
Download: 0.00 Mbit/s
Testing upload speed......................................................................................................
Upload: 326.61 Mbit/s

There's still the same 0.00 Mbit/s download with runsc do, but can put that aside.

We have used --gso=false in the past and it degraded performance #9816 (comment), but we can selectively enable it for H100s because it'll be better 👍

copybara-service bot pushed a commit that referenced this issue May 15, 2024
Throughput could be lowered by entering fast recovery unnecessarily. When a
larger-than-MTU segment was retransmitted as multiple segments, loss detection
could fire either because we hit the dupack threshold or RACK detected loss due
to ACKs from retransmissions.

RACK was more succeptible to this because it's better at detecting loss and can
do so even without 3 dupacks. Thus it fell into this trap more often.

Addresses #10344.

PiperOrigin-RevId: 631967721
copybara-service bot pushed a commit that referenced this issue May 15, 2024
Throughput could be lowered by entering fast recovery unnecessarily. When a
larger-than-MTU segment was retransmitted as multiple segments, loss detection
could fire either because we hit the dupack threshold or RACK detected loss due
to ACKs from retransmissions.

RACK was more succeptible to this because it's better at detecting loss and can
do so even without 3 dupacks. Thus it fell into this trap more often.

Addresses #10344.

PiperOrigin-RevId: 634071568
@kevinGC
Copy link
Collaborator

kevinGC commented May 15, 2024

Glad that flag works. Not sure where that awful download stat comes from; I don't see it when I try to replicate at any commit. Will keep looking, especially if you're still seeing it after these patches.

@avagin
Copy link
Collaborator

avagin commented May 16, 2024

I spent a few hours trying to figure out what can be wrong with our gso packets. At some point, I started thinking that we were looking for a black cat in a dark room. I decided to test this version by running a kata container and checking that the issue is reproducible in that environment. A kata container is a virtual machine with a virtio network device. It injects gso packets from guest to the host linux kernel in a similar way as gvisor does, but they use different kernel API to do that. Inside a kata vm, the linux kernel is running, so it is completely unrelated to the gVisor netstack. It was not a surprise when I found that the same issue is triggered in kata containers:

# uname -a
Linux 41339094ec18 6.1.62 #1 SMP Wed May 15 05:03:25 UTC 2024 x86_64 Linux
/ # lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 Communication controller: Red Hat, Inc. Virtio console
00:02.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
00:03.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
00:04.0 Unclassified device [00ff]: Red Hat, Inc. Virtio RNG
00:05.0 Communication controller: Red Hat, Inc. Virtio 1.0 socket (rev 01)
00:06.0 Mass storage controller: Red Hat, Inc. Virtio file system (rev 01)
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
01:01.0 Ethernet controller: Red Hat, Inc. Virtio network device
/ # python3 /tmp/speedtest-cli 
Retrieving speedtest.net configuration...
Testing from Google Cloud...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by StarHub Ltd (Singapore) [5.78 km]: 2.561 ms
Testing download speed................................................................................
Download: 3019.78 Mbit/s
Testing upload speed......................................................................................................
Upload: 2.26 Mbit/s

In summary, I'm inclined to believe that this issue isn't tied to gVisor. More likely, it resides either within the Linux kernel itself or within the gvnic device or its driver.

@manninglucas
Copy link
Contributor

manninglucas commented May 21, 2024

To those interested, @avagin found the cause of this bug. It's a small issue with the GVE network driver that's used on some GCP hardware. The driver code can be found here.

This code in the driver drops a packet if the GSO type isn't exactly equal to SKB_GSO_TCPV4 or SKB_GSO_TCPV6. In the case of gVisor/Kata/any process that injects packets with virtio net headers already set, the kernel actually marks these packets with an additional flag SKB_GSO_DODGY. The packets fail the check for SKB_GSO_TCPV4 because of this extra flag on the GSO type and get dropped. These drops only happen on H100 because those machines use a different kind of of NIC that requires packets to be written in the DQO format rather than the default. In the default path there is no equivalent check for SKB_GSO_TCPV4.

We will try to expedite a fix in the GVE driver the best we can from our end. Filing a formal support ticket with GCP may help move along the process as well.

Closing this issue now as it is not a bug with gVisor.

@thundergolfer
Copy link
Contributor Author

Nice one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: networking Issue related to networking type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants