`simult_flows`: unbalanced bwidth tests are unstable #475

matttbe · 2024-02-15T15:44:42Z

Mainly according to NIPA (history normal - debug), this test is unstable. It looks like we can see the instability with and without unbalanced delay, but mainly with.

In NIPA, a lot of tests are executed in parallel. In theory, it should be fine, most selftests don't use all available resources. Still, even if the issue is happening in the non debug mode, this could justify some instabilities.

1011924 selftests: mptcp: simult flows: re-adapt BW patch in our tree could improve the situation, but it has not been properly measured.

It would be good to:

check if the estimations done in the scripts are OK
check some counters that could explain extra delays -- losses (especially at the beginning/end), retransmissions, etc. -- and add some slacks
take into account the extra delay to establish the connections? (but the issue can be seen without extra delays)
maybe a pending fix could improve the situation?

The text was updated successfully, but these errors were encountered:

matttbe · 2024-03-06T18:31:13Z

I did some tests on my side. I had to use the non-debug mode (otherwise we already increase the tolerance), without KVM.

It looks like the patch from our tree (selftests: mptcp: simult flows: re-adapt BW) helps. When I run the test 100 times:

without the patch, I can reproduce the issue (without KVM) around 10% of the time.
with the patch, I only had the issue 2 times.

I tried to get more info in case of issues (without the patch). When there is an issue:

It looks like the traffic dropped a bit in the middle:
During this transfer, the MPTcpExtDuplicateData on the client side is 8, and MPTcpExtMPTCPRetrans is at 8 on the server side.
(that was for the test using the port 10002)

unbalanced bwidth - reverse direction                       transfer slower than expected! runtime 12033 ms, expected 11921 ms max 11921       [ fail ]

If it can be normal to have these counters > 0, maybe we can increase the max value? A technical issue in this test is that we give the timeout to mptcp_connect (the binary)

If not, here are the packet captures:

when there is an error: capture-error.zip (nstat-error.zip)
the same test, when there is no error: capture-success.zip (nstat-success.zip)

Here are the MIB counters:

Server side:

Counters	OK	Err	Diff
IpInReceives	1585	867	-718
IpInDelivers	1585	867	-718
IpOutRequests	1654	933	-721
IpOutTransmits	1654	933	-721
TcpActiveOpens	2	2	0
TcpInSegs	1585	867	-718
TcpOutSegs	12151	12027	-124
Ip6OutRequests	1	1	0
Ip6OutMcastPkts	1	1	0
Ip6OutOctets	56	56	0
Ip6OutMcastOctets	56	56	0
Ip6OutTransmits	1	1	0
Icmp6OutMsgs	1	1	0
Icmp6OutRouterSolicits	1	1	0
Icmp6OutType133	1	1	0
TcpExtTW	2	2	0
TcpExtDelayedACKLost	1	0	-1
TcpExtTCPPureAcks	1556	838	-718
TcpExtTCPLossProbes	2	2	0
TcpExtTCPDSACKOldSent	1	0	-1
TcpExtTCPSpuriousRtxHostQueues	1	1	0
TcpExtTCPOrigDataSent	12128	12004	-124
TcpExtTCPHystartTrainDetect	1	0	-1
TcpExtTCPHystartTrainCwnd	171	0	-171
TcpExtTCPHystartDelayDetect	1	2	1
TcpExtTCPHystartDelayCwnd	222	724	502
TcpExtTCPDelivered	12130	12006	-124
IpExtInOctets	198548	137920	-60628
IpExtOutOctets	16909204	16947976	38772
IpExtInNoECTPkts	1634	906	-728
MPTcpExtMPCapableSYNTX	1	1	0
MPTcpExtMPCapableSYNACKRX	1	1	0
MPTcpExtMPTCPRetrans	0	8	8
MPTcpExtMPJoinSynAckRx	1	1	0
MPTcpExtOFOQueueTail	9	8	-1
MPTcpExtOFOQueue	11	9	-2
MPTcpExtOFOMerge	8	7	-1
MPTcpExtDuplicateData	1	0	-1
MPTcpExtSndWndShared	1	0	-1
MPTcpExtRcvWndShared	11	7	-4
Time (ms) (limit : 11921)	11344	12033	689

Client side:

Counters	Success	Error	Diff
IpInReceives	1654	933	-721
IpInDelivers	1654	933	-721
IpOutRequests	1585	867	-718
IpOutTransmits	1585	867	-718
TcpPassiveOpens	2	2	0
TcpInSegs	1654	933	-721
TcpOutSegs	1633	906	-727
TcpRetransSegs	1	0	-1
Ip6OutRequests	1	0	-1
Ip6OutMcastPkts	1	0	-1
Ip6OutOctets	56	0	-56
Ip6OutMcastOctets	56	0	-56
Ip6OutTransmits	1	0	-1
Icmp6OutMsgs	1	0	-1
Icmp6OutRouterSolicits	1	0	-1
Icmp6OutType133	1	0	-1
TcpExtPAWSEstab	1	0	-1
TcpExtTW	0	1	1
TcpExtDelayedACKs	7	6	-1
TcpExtTCPPureAcks	20	21	1
TcpExtTCPLossProbes	3	4	1
TcpExtTCPDSACKRecv	1	0	-1
TcpExtTCPOrigDataSent	75	66	-9
TcpExtTCPDelivered	76	66	-10
TcpExtTCPDSACKRecvSegs	1	0	-1
IpExtInOctets	16909204	16947976	38772
IpExtOutOctets	198548	137920	-60628
IpExtInNoECTPkts	12151	12027	-124
MPTcpExtMPCapableSYNRX	1	1	0
MPTcpExtMPCapableACKRX	1	1	0
MPTcpExtMPTCPRetrans	1	0	-1
MPTcpExtMPJoinSynRx	1	1	0
MPTcpExtMPJoinAckRx	1	1	0
MPTcpExtOFOQueueTail	849	492	-357
MPTcpExtOFOQueue	868	504	-364
MPTcpExtOFOMerge	503	405	-98
MPTcpExtDuplicateData		8	8
MPTcpExtRcvWndShared	568	362	-206
MPTcpExtRcvWndConflictUpdate	1		-1
Time (ms) (limit : 11921)	11344	12033	689

(the TcpIn/OutSegs are interesting: not the same number of the client/server: issue? GRO?)

Maybe a first step is to send the patch we have in our tree, because it looks like it is already helping? Or we want a better fix with the new BW limits? cc: @pabeni

matttbe · 2024-03-14T11:59:19Z

Note: the same graph as last time, but for the "success" run: (cc: @mjmartineau )

And all the captures I had: simult_flows_analyses.zip (cc: @pabeni )

"fail": only the 2nd test failed
"success": all tests were successful

mjmartineau · 2024-03-14T16:45:38Z

Thanks for the success chart. In the failure graph it's interesting that the segment lengths settle in to three main groups around the 6 second mark until the throughput drops around 6.75s:

~34000
~30000
<1000

Not sure what it means yet. The success graph maintains the more randomized pattern without the "ceiling" around 34000 for segment length.

These tests are flaky since their introduction. This might be less or not visible depending on the CI running the tests, especially if it is also busy doing other tasks in parallel. A first analysis shown that the transfer can be slowed down when there are some re-injections at the MPTCP level. Such re-injections can of course happen, and disturb the transfer, but it looks strange to have them in this lab. That could be caused by the kernel having access to less CPU cycles -- e.g. when other activities are executed in parallel -- or by a misinterpretation on the MPTCP packet scheduler side. While this is being investigated, the tests are marked as flaky not to create noises in other CIs. Link: #475 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Message-Id: <20240520-selftests-mptcp-disable-flaky-v1-2-02e23ba0bc3b@kernel.org>

These tests are flaky since their introduction. This might be less or not visible depending on the CI running the tests, especially if it is also busy doing other tasks in parallel. A first analysis shown that the transfer can be slowed down when there are some re-injections at the MPTCP level. Such re-injections can of course happen, and disturb the transfer, but it looks strange to have them in this lab. That could be caused by the kernel having access to less CPU cycles -- e.g. when other activities are executed in parallel -- or by a misinterpretation on the MPTCP packet scheduler side. While this is being investigated, the tests are marked as flaky not to create noises in other CIs. Link: multipath-tcp/mptcp_net-next#475 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

These tests are flaky since their introduction. This might be less or not visible depending on the CI running the tests, especially if it is also busy doing other tasks in parallel. A first analysis shown that the transfer can be slowed down when there are some re-injections at the MPTCP level. Such re-injections can of course happen, and disturb the transfer, but it looks strange to have them in this lab. That could be caused by the kernel having access to less CPU cycles -- e.g. when other activities are executed in parallel -- or by a misinterpretation on the MPTCP packet scheduler side. While this is being investigated, the tests are marked as flaky not to create noises in other CIs. Link: #475 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Message-Id: <20240521-selftests-mptcp-disable-flaky-v2-2-eba143bcbad0@kernel.org>

These tests are flaky since their introduction. This might be less or not visible depending on the CI running the tests, especially if it is also busy doing other tasks in parallel. A first analysis shown that the transfer can be slowed down when there are some re-injections at the MPTCP level. Such re-injections can of course happen, and disturb the transfer, but it looks strange to have them in this lab. That could be caused by the kernel having access to less CPU cycles -- e.g. when other activities are executed in parallel -- or by a misinterpretation on the MPTCP packet scheduler side. While this is being investigated, the tests are marked as flaky not to create noises in other CIs. Link: #475 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Message-Id: <20240522-selftests-mptcp-disable-flaky-v3-2-a10c68bf2680@kernel.org>

These tests are flaky since their introduction. This might be less or not visible depending on the CI running the tests, especially if it is also busy doing other tasks in parallel. A first analysis shown that the transfer can be slowed down when there are some re-injections at the MPTCP level. Such re-injections can of course happen, and disturb the transfer, but it looks strange to have them in this lab. That could be caused by the kernel having access to less CPU cycles -- e.g. when other activities are executed in parallel -- or by a misinterpretation on the MPTCP packet scheduler side. While this is being investigated, the tests are marked as flaky not to create noises in other CIs. Link: #475 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

This reverts commit 7e6c60f. The previous values helped to find a bug. Better to keep them then, even if it is slower, because the transfer is longer. Link: #475 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

These tests are flaky since their introduction. This might be less or not visible depending on the CI running the tests, especially if it is also busy doing other tasks in parallel. A first analysis shown that the transfer can be slowed down when there are some re-injections at the MPTCP level. Such re-injections can of course happen, and disturb the transfer, but it looks strange to have them in this lab. That could be caused by the kernel having access to less CPU cycles -- e.g. when other activities are executed in parallel -- or by a misinterpretation on the MPTCP packet scheduler side. While this is being investigated, the tests are marked as flaky not to create noises in other CIs. Link: #475 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

matttbe added bug selftests labels Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`simult_flows`: unbalanced bwidth tests are unstable #475

`simult_flows`: unbalanced bwidth tests are unstable #475

matttbe commented Feb 15, 2024

matttbe commented Mar 6, 2024

matttbe commented Mar 14, 2024

mjmartineau commented Mar 14, 2024

simult_flows: unbalanced bwidth tests are unstable #475

simult_flows: unbalanced bwidth tests are unstable #475

Comments

matttbe commented Feb 15, 2024

matttbe commented Mar 6, 2024

matttbe commented Mar 14, 2024

mjmartineau commented Mar 14, 2024

`simult_flows`: unbalanced bwidth tests are unstable #475

`simult_flows`: unbalanced bwidth tests are unstable #475