Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp_benchmark --client throughput craters after a few seconds #10312

Open
kevinGC opened this issue Apr 22, 2024 · 1 comment
Open

tcp_benchmark --client throughput craters after a few seconds #10312

kevinGC opened this issue Apr 22, 2024 · 1 comment
Labels
type: bug Something isn't working

Comments

@kevinGC
Copy link
Collaborator

kevinGC commented Apr 22, 2024

Description

//test/benchmarks/tcp:tcp_benchmark can run with netstack as the iperf client, server, or neither (native). As the server (and with host GRO/GSO enabled) I see throughput similar to native. As the client, I regularly see the same pattern: a few seconds of throughput at parity with Linux followed by a complete cratering of throughput:

shrinkingWindow

Looking at the logs, this looks to be triggered by netstack's inability to handle shrinking receive windows. In the pcap
shrinkingWindowMini.pcap.zip (shrunk down to only the relevant packets to keep the file size manageable), you can see two things. First, there are several instances of "normal" full receive buffers / zero windows. These are the reason for the graph's flat shape: transfer is limited by rwnd, not cwnd.

Second, at the end of the capture is the sequence of packets that corresponds to the massive throughput drop in the graph. There are two notable bits here:

  • There's an RTO-sized gap between the zero window ACK and the next packet (which is our zero window probe)
  • The receive window shrinks. This can't be seen in the sliced-up pcap (because it lacks the handshake with the window size), but with the full log:
Screenshot 2024-04-21 at 4 42 48 PM

Note that the [TCP Window Full] packet has sequence number 319710246 and length 1920, indicating that this fills the receive window. But the [TCP ZeroWindow] packet has the same sequence number, meaning that the 1920 sent bytes are out of window. Thus netstack considers this an RTO and drops the cwnd all the way to 1 segment, causing the slowdown.

But per RFC 9293 3.8.6, netstack shouldn't consider those bytes relevant to an RTO:

   A TCP receiver SHOULD NOT shrink the window, i.e., move the right
   window edge to the left (SHLD-14).  However, a sending TCP peer MUST
   be robust against window shrinking, which may cause the "usable
   window" (see Section 3.8.6.2.1) to become negative (MUST-34).

   If this happens, the sender SHOULD NOT send new data (SHLD-15), but
   SHOULD retransmit normally the old unacknowledged data between
   SND.UNA and SND.UNA+SND.WND (SHLD-16).  The sender MAY also
   retransmit old data beyond SND.UNA+SND.WND (MAY-7), but SHOULD NOT
   time out the connection if data beyond the right window edge is not
   acknowledged (SHLD-17).  If the window shrinks to zero, the TCP
   implementation MUST probe it in the standard way (described below)
   (MUST-35).

I.e. we should be treating this case like a regular zero-window. In terms of a fix, we can maybe have RTO handling only adjust cwnd when sent bytes are in-window.

Steps to reproduce

This has some excessive sudos leftover from when I was testing XDP:

$ bazel build //test/benchmarks/tcp:all && sudo cp bazel-bin/test/benchmarks/tcp/tcp_proxy_/tcp_proxy bazel-bin/test/benchmarks/tcp/tcp_proxy && sudo bazel-bin/test/benchmarks/tcp/tcp_benchmark --duration 20 --ideal --gso 65536 --no-user-ns --client

runsc version

N/A: This is netstack-specific

docker version (if using docker)

N/A: This is netstack-specific

repo state (if built from source)

release-20240415.0-18-g4810afc36

runsc debug logs (if available)

No response

@kevinGC kevinGC added the type: bug Something isn't working label Apr 22, 2024
@hbhasker
Copy link
Contributor

So I think the issue arises from the fact that we piggy back on RTO timer.

As per RFC the behavior is correct to probe after 200ms

https://datatracker.ietf.org/doc/html/rfc9293#section-3.8.6.1

But because we piggy back on an RTO

s.resendTimer.enable(s.RTO)
we end up taking an RTO and reducing cwnd.

The right solution might be to split the timer and disable the RTO timer when probing for zero window. I would check the linux implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants