Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retiring the initial connection ID #77

Open
WesleyRosenblum opened this issue Mar 9, 2021 · 9 comments
Open

Retiring the initial connection ID #77

WesleyRosenblum opened this issue Mar 9, 2021 · 9 comments
Assignees
Labels

Comments

@WesleyRosenblum
Copy link

While testing interoperability of the Quant client, we've encountered an issue related to Quant's handling of NEW_CONNECTION_ID frames.

One scenario that triggers these issues involves a peer that issues a NEW_CONNECTION_ID frame that requests to retire all previously issued connection IDs. This should be allowed by the following from QUIC Transport §5.1.2:

Upon receipt of an increased Retire Prior To field, the peer MUST stop using the corresponding connection IDs and retire them with RETIRE_CONNECTION_ID frames before adding the newly provided connection ID to the set of active connection IDs. This ordering allows an endpoint to replace all active connection IDs without the possibility of a peer having no available connection IDs and without exceeding the limit the peer sets in the active_connection_id_limit transport parameter; see Section 18.2.

Let's say Quant has just completed a handshake with a peer. At this point, Quant has a single connection ID (with sequence ID 0) to communicate with the peer and the Retire Prior To value is 0.

Now, say the following NEW_CONNECTION_ID frames, which indicates the peer no longer wants to use the connection ID used during the handshake, are received by Quant:

NEW_CONNECTION_ID Frame {
     Type = 0x18,
     Sequence Number = 1,
     Retire Prior To = 1,
     Length (8) = 16,
     Connection ID = f780dae634e407fcf8ef015bfa9fc9fe,
     Stateless Reset Token = 5605420c8220e5afdf4762a52fc03519,
   }

NEW_CONNECTION_ID Frame {
     Type = 0x18,
     Sequence Number = 2,
     Retire Prior To = 1,
     Length (8) = 16,
     Connection ID = 53376708d16b58ee742860e80ea730af,
     Stateless Reset Token = d95fbead5bbd916166f22040a1620918,
   }

NEW_CONNECTION_ID Frame {
     Type = 0x18,
     Sequence Number = 3,
     Retire Prior To = 1,
     Length (8) = 16,
     Connection ID = dd69e1f1045f09c19f7ee6419d206fe9,
     Stateless Reset Token = ae66adb0abe28f7029816767fb35f794,
   }

As soon as Quant processes this frame and attempts to retire the initial connection ID, the following assertion failure occurs:

dec_frames frame.c:1393 HANDSHAKE_DONE
abandon_pn pn.c:136 abandoning clnt Handshake processing
validate_pmtu pkt.c:174 PMTU 1472 validated
dec_new_cid_frame frame.c:1105 NEW_CONNECTION_ID seq=1 rpt=1 len=16 dcid=1:f780dae634e407fcf8ef015bfa9fc9fe srt=5605420c8220e5afdf4762a52fc03519
dec_new_cid_frame frame.c:1105 NEW_CONNECTION_ID seq=2 rpt=1 len=16 dcid=2:53376708d16b58ee742860e80ea730af srt=d95fbead5bbd916166f22040a1620918
dec_new_cid_frame frame.c:1129 rpt 1 <= prev max 1, ignoring
dec_new_cid_frame frame.c:1105 NEW_CONNECTION_ID seq=3 rpt=1 len=16 dcid=3:dd69e1f1045f09c19f7ee6419d206fe9 srt=ae66adb0abe28f7029816767fb35f794
dec_new_cid_frame frame.c:1129 rpt 1 <= prev max 1, ignoring
dec_retire_cid_frame frame.c:1194 RETIRE_CONNECTION_ID seq=0
0.146   conns_by_id_del conn.c:787 conns_by_id size 3
conns_by_id_del conn.c:794 conns_by_id size 2
use_next_dcid conn.c:215 migration to dcid 1:f780dae634e407fcf8ef015bfa9fc9fe for clnt conn (was 0:f66838703f6e489aab5e1ee100212fa2)
0.147   cid_retire cid.c:178 ABORT: assertion failed: 
id->available == false && id->retired == false 
can only retire active cid [errno 2 = No such file or directory]
/lib/x86_64-linux-gnu/libasan.so.5(+0x6cd30) [0x7fd225729d30]
util_die at /src/lib/deps/warpcore/lib/src/util.c:398
cid_retire at /src/lib/src/cid.c:178
use_next_dcid at /src/lib/src/conn.c:222
do_conn_mgmt at /src/lib/src/conn.c:534
tx at /src/lib/src/conn.c:701
rx at /src/lib/src/conn.c:1658
loop_run at /src/lib/src/loop.c:103 (discriminator 3)
q_ready at /src/lib/src/quic.c:1008 (discriminator 3)
main at /src/bin/client.c:613

I believe this may be happening due to cid.available being initialized to true for the first connection ID in use (link).

I also noticed several uses of the NO_MIGRATION directives in parts of the code dealing handling NEW_CONNECTION_ID frames and sending RETIRE_CONNECTION_ID frames (for example). A peer should be able to issue NEW_CONNECTION_IDs without actually performing or allowing connection migrations. This may be done, for example, because the peer may only wish to use a given connection ID for a limited amount of time.

Thanks for considering this issue and let me know if you need any further information!

@larseggert
Copy link
Contributor

Thanks for the report! Most stacks haven't been sending rpt values other than 0 yet, only msquic recently started to do so. So that code is very lightly tested still, I'm not surprised there are issues.

I recently fixed a bug related to rpt handling (in bfb9b74) - did you test with a version that is older or newer than that?

Regarding NO_MIGRATION, I'm not sure I follow your explanation. If a peer isn't performing or allowing connection migrations, how can it "only wish to use a given connection ID for a limited amount of time"?

@larseggert
Copy link
Contributor

Also, do you have a way for me to reproduce this locally?

@WesleyRosenblum
Copy link
Author

Thanks Lars!

The server I was testing with is not yet compatible with the Draft 33 initial salt used by Quant, so I was testing against a version from prior to this commit: 6570d70

I do see the recent msquic runs against Quant showing rpt=1 and it seems to be working, so perhaps that bug fix may have already fixed it. Unfortunately I don't have a way for you to reproduce this locally or to try against more recent versions of Quant. If you have an ability to create an integration test for this that would be best to ensure it works and doesn't regress in the future.

Regarding NO_MIGRATION, I'm not sure I follow your explanation. If a peer isn't performing or allowing connection migrations, how can it "only wish to use a given connection ID for a limited amount of time"?

Using a new connection ID is a requirement of connection migration, but it's not the only reason an endpoint may wish to rotate connection IDs and there is no requirement to migrate connections when new connection IDs are issued or retired. I had a similar discussion with Picoquic on this topic: private-octopus/picoquic#1151

@larseggert
Copy link
Contributor

OK, I'll leave this open until your code is on draft -34 and you can re-test? I unfortunately don't have a test suite - quant will get archived when the RFCs are out, it served as research code and not for any sort of deployment. So I cut lots of corners.

Regarding NO_MIGRATION, you are correct. Earlier versions of the spec allowed sending active_connection_id_limit of zero, which made this optimization allowed. I should revise that. (Many of those defines were only added to produce stats for https://eggert.org/papers/2020-ndss-quic-iot.pdf.)

@larseggert
Copy link
Contributor

@WesleyRosenblum have you had a chance to test this again?

@WesleyRosenblum
Copy link
Author

Hi Lars. We haven't updated to draft 34 yet and it may still be a little while longer. If you'd prefer you can close this and I can reopen it if necessary when we get to updating and testing this out.

@WesleyRosenblum
Copy link
Author

We've updated to QUIC v1, and this does seem to still be a problem, both for Quant client and server.

Here is a recent log output:

dec_retire_cid_frame frame.c:1214 RETIRE_CONNECTION_ID seq=0
dec_frames frame.c:1349 PADDING len=9
log_pkt pkt.c:159 TX to=193.167.100.100:443 0x60=Short kyph=0 spin=1 dcid=0:3087ddf3cd51584a7e52621e05bc8d95 nr=3
enc_ack_frame frame.c:1593 ACK 0x03=ECN lg=1 delay=813 (6504 usec) cnt=0 rng=1 [0..1]
enc_ack_frame frame.c:1627 ECN ect0=2 ect1=0 ce=0
enc_new_cid_frame frame.c:1986 NEW_CONNECTION_ID seq=3 rpt=0 len=4 cid=3:a2804f11 srt=95572ac3679607992b75d7d2eeb359e2 
log_sent_pkts conn.c:271 clnt conn 1:de4552eb, Data unacked: 3
log_cc recovery.c:123 clnt conn 1:de4552eb: in_flight=67 (+67), cwnd=13591 (+0), ssthresh=0 (+0), srtt=0.048 (+0.000), rttvar=0.016 (+0.000)
0.137   log_pkt pkt.c:118 RX from=193.167.100.100:443 len=115 ecn=ECT0 0x40=Short kyph=0 spin=0 dcid=0:de4552eb nr=2
dec_frames frame.c:1416 HANDSHAKE_DONE
dec_new_cid_frame frame.c:1112 NEW_CONNECTION_ID seq=1 rpt=1 len=16 dcid=1:7c35ef83b12a37fcc115a8a02b214793 srt=5b1c4ea562785c40a92529edbd6526d5
0.137   use_next_dcid conn.c:221 ABORT: assertion failed: 
dcid 
can't switch from dcid 0 [errno 22 = Invalid argument]
/lib/x86_64-linux-gnu/libasan.so.5(+0x6cd30) [0x7f14429f9d30]
util_die at /src/lib/deps/warpcore/lib/src/util.c:399
use_next_dcid at /src/lib/src/conn.c:221
dec_new_cid_frame at /src/lib/src/frame.c:1152 (discriminator 3)
dec_frames at /src/lib/src/frame.c:1465
rx_pkt at /src/lib/src/conn.c:1203
rx_pkts at /src/lib/src/conn.c:1551
rx at /src/lib/src/conn.c:1664
loop_run at /src/lib/src/loop.c:103 (discriminator 3)
q_ready at /src/lib/src/quic.c:951
main at /src/bin/client.c:617
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f1441b6d0b3]
_start at ??:?

@larseggert
Copy link
Contributor

Thanks! I'll take a look. (But quant has sort of outlived its usefulness as a test implementation now that v1 is done, so this is not high priority for me at then moment.)

@WesleyRosenblum
Copy link
Author

We do have a public QNS image you can use to reproduce this now: public.ecr.aws/s2n/s2n-quic-qns:latest

@larseggert larseggert self-assigned this Feb 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants