Skip to content

[archived] Initial Design

Matthieu Baerts edited this page Feb 24, 2021 · 1 revision

Initial Design

This corresponds to the initial design, when we were looking at a limited feature set for initial upstreaming.

!! This is then probably outdated. !!


Proposed Overall Guidelines

  • MPTCP is used when requested by the application, either through an IPPROTO_MPTCP parameter to socket() or by using the new ULP (Upper Layer Protocol) capability.

  • Focus on a limited feature set for initial upstreaming:

    • Server use case
    • Baseline functionality
    • Performance optimizations can be deferred
    • Path manager and scheduler customization are deferred
  • Avoid adding indirect function calls to fast paths.

  • Propose TCP option features in the context of existing code first, and abstraction of TCP options should be figured out later.

  • Move away from meta-sockets, treating each subflow more like a regular TCP connection. The overall MPTCP connection is coordinated by an upper layer socket that is distinct from tcp_sock.

  • Move functionality to userspace where possible, like tracking ADD_ADDRs received, initiating new subflows, or accepting new subflows.

  • Avoid adding locks to coordinate access to data that's shared between subflows. Utilize capabilities like compare-and-swap (cmpxchg), atomics, and RCU to deal with shared data efficiently.


Metadata Queuing

Our multipath TCP architecture in the Linux kernel needs to propagate MPTCP metadata between the upper mptcp_sock layer and TCP option handling in each TCP subflow.

The requirements are:

  1. Pass received MPTCP option values (found in the TCP option headers) to the mptcp_sock. Should not propagate option values from dropped packets.
  2. Make outgoing DSS mapping data available to the TCP option writing code. Need to be able to send the DSS option with the earliest data byte that mapping applies to. If the earliest data byte in the mapping is retransmitted, also retransmit the mapping.
  3. Be able to send other MPTCP sub-options (ADD_ADDR, REMOVE_ADDR, etc.) on outgoing ACK or data packets.

Incoming option values

Our first prototype bypasses skb coalesce and collapse for incoming TCP subflows, so there is one skb per TCP packet in each subflow receive queue. This allows access to the TCP headers in each packet by accessing the transport headers, but the approach seems fragile and it would be better to keep normal coalesce/collapse behavior.

Proposal: Use the socket error queue for incoming MPTCP option values. Incoming packets with MPTCP options are either cloned or header byte payload is extracted into a new skb, and the resulting skb with only header payload is placed in the TCP subflow error queue with a sock_exterr_skb struct in the control block indicating an MPTCP-specific error origin or type (SO_EE_ORIGIN_MPTCP?). The mptcp_sock could then read and process the error queue, and make use of the mapping data to determine which subflow to read data from next. This is similar to the use of the error queue for SO_TIMESTAMPING for transmitted packets.

Alternative: A separate queue could be added to each subflow_sock for the incoming MPTCP option values. This could avoid extra skb overhead for each incoming option, but would not leverage the existing queue or code supporting it.

Outgoing DSS mappings

DSS mappings are closely tied to packet data, so a natural place to add the DSS mapping data is to the sk_buff structure. In our prototype, we created an optional "shared control buffer" at the end of skb_shared_info. Eric Dumazet provided feedback that it could be ok to add to struct sk_buff if the new data was not initialized.

Proposal: Add an unsigned char *private pointer to the end of struct sk_buff, which is not initialized by alloc_skb(). Also add a flag to the existing bitfields to indicate the the private pointer is in use: __u8 private_in_use:1, and some helper functions. A skb owner would then be able to allocate data, than assign *private and the flag. skb_release_head_state() would free the allocated memory when private_in_use is set. MPTCP would make use of this extra space to store DSS-related data (currently around 25 bytes). Some changes to do_tcp_sendpages() are still required to allow the skb to be updated with the DSS values before it is pushed.

Alternative: Add a new queue (or rb tree?) for DSS records. When sending or resending data packets, if the earliest byte covered by a mapping is part of the payload, populate the DSS option from the stored records. Clean up the relevant DSS record when the data it covers is ACKed.

Outgoing non-DSS options

Other MPTCP options like ADD_ADDR or REMOVE_ADDR need to be sent, either as part of an ACK packet or a data packet with sufficient option space. Ideally the functionality is also generic enough to use when completing the MPTCP connection handshake.

Proposal: Add a function to directly send an ACK with a specified MPTCP option payload.

Alternative: Add a lockless linked list to struct subflow_sock. Options would be added without acquiring a lock, and each packet send (ACK or data) would check for options awaiting transmission. The mptcp_sock would request an ACK send when it enqueues an option request.


API Design

Socket creation with socket()

By default, TCP sockets will continue to use regular TCP. MPTCP applications will opt in to MPTCP using the socket API.

Our current prototype implements IPPROTO_MPTCP, but this could be switched to AF_MULTIPATH without disrupting a lot of code.

IPPROTO_MPTCP proposal: Use the protocol parameter in the socket() system call to request MPTCP sockets.

Example: sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

  • MPTCP is a layer above IP and can be grouped with TCP/UDP/UDP-Lite/SCTP/etc. under AF_INET and AF_INET6 families. MPTCP is intended to have a TCP-like interface.
  • Various combinations of AF_INET, AF_INET6, and the IPV6_V6ONLY socket option provide a way to control v4/v6 selection for the initial subflow and whether later subflows may be mixed v4/v6.
  • Compatible with the existing python socket library. It is not clear if any other languages or frameworks have a similar benefit.
  • Could add a BPF_CGROUP hook to rewrite the socket protocol in specific cgroups, which could rewrite the protocol number to make TCP applications in that group use MPTCP by default.

AF_MULTIPATH proposal:: Use the domain parameter in the socket() system call to request MPTCP sockets.

Example: sock = socket(AF_MULTIPATH, SOCK_STREAM, 0);

  • Similar to AF_KCM, the multipath socket could be thought of as a separate socket type that makes use of TCP sockets, but not itself a direct layer on top of IP.
  • MPTCP might not need everything inet_create() does.
  • A BPF_CGROUP hook could rewrite domain to make TCP applications in a group default to MPTCP.

Other Features

Break-before-make: Initial upstream may close the MPTCP connection when all subflows go away. Would be acceptable to implement break-before-make later if it simplifies the original code.

TCP Fast Open: Initial implementation may reject TFO and fall back to 3-way handshake.

Duplicate DSS in consecutive packets: Having identical TCP options in consecutive packets allows for packet aggregation by the receiver or middleboxes.