Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simult_flows: unbalanced bwidth tests are unstable #475

Open
matttbe opened this issue Feb 15, 2024 · 3 comments
Open

simult_flows: unbalanced bwidth tests are unstable #475

matttbe opened this issue Feb 15, 2024 · 3 comments

Comments

@matttbe
Copy link
Member

matttbe commented Feb 15, 2024

Mainly according to NIPA (history normal - debug), this test is unstable. It looks like we can see the instability with and without unbalanced delay, but mainly with.

In NIPA, a lot of tests are executed in parallel. In theory, it should be fine, most selftests don't use all available resources. Still, even if the issue is happening in the non debug mode, this could justify some instabilities.

1011924 selftests: mptcp: simult flows: re-adapt BW patch in our tree could improve the situation, but it has not been properly measured.

It would be good to:

  • check if the estimations done in the scripts are OK
  • check some counters that could explain extra delays -- losses (especially at the beginning/end), retransmissions, etc. -- and add some slacks
  • take into account the extra delay to establish the connections? (but the issue can be seen without extra delays)
  • maybe a pending fix could improve the situation?
@matttbe
Copy link
Member Author

matttbe commented Mar 6, 2024

I did some tests on my side. I had to use the non-debug mode (otherwise we already increase the tolerance), without KVM.

It looks like the patch from our tree (selftests: mptcp: simult flows: re-adapt BW) helps. When I run the test 100 times:

  • without the patch, I can reproduce the issue (without KVM) around 10% of the time.
  • with the patch, I only had the issue 2 times.

I tried to get more info in case of issues (without the patch). When there is an issue:

  • It looks like the traffic dropped a bit in the middle:
    image
  • During this transfer, the MPTcpExtDuplicateData on the client side is 8, and MPTcpExtMPTCPRetrans is at 8 on the server side.
  • (that was for the test using the port 10002)
unbalanced bwidth - reverse direction                       transfer slower than expected! runtime 12033 ms, expected 11921 ms max 11921       [ fail ]

If it can be normal to have these counters > 0, maybe we can increase the max value? A technical issue in this test is that we give the timeout to mptcp_connect (the binary)

If not, here are the packet captures:

Here are the MIB counters:

  • Server side:
Counters OK Err Diff
IpInReceives 1585 867 -718
IpInDelivers 1585 867 -718
IpOutRequests 1654 933 -721
IpOutTransmits 1654 933 -721
TcpActiveOpens 2 2 0
TcpInSegs 1585 867 -718
TcpOutSegs 12151 12027 -124
Ip6OutRequests 1 1 0
Ip6OutMcastPkts 1 1 0
Ip6OutOctets 56 56 0
Ip6OutMcastOctets 56 56 0
Ip6OutTransmits 1 1 0
Icmp6OutMsgs 1 1 0
Icmp6OutRouterSolicits 1 1 0
Icmp6OutType133 1 1 0
TcpExtTW 2 2 0
TcpExtDelayedACKLost 1 0 -1
TcpExtTCPPureAcks 1556 838 -718
TcpExtTCPLossProbes 2 2 0
TcpExtTCPDSACKOldSent 1 0 -1
TcpExtTCPSpuriousRtxHostQueues 1 1 0
TcpExtTCPOrigDataSent 12128 12004 -124
TcpExtTCPHystartTrainDetect 1 0 -1
TcpExtTCPHystartTrainCwnd 171 0 -171
TcpExtTCPHystartDelayDetect 1 2 1
TcpExtTCPHystartDelayCwnd 222 724 502
TcpExtTCPDelivered 12130 12006 -124
IpExtInOctets 198548 137920 -60628
IpExtOutOctets 16909204 16947976 38772
IpExtInNoECTPkts 1634 906 -728
MPTcpExtMPCapableSYNTX 1 1 0
MPTcpExtMPCapableSYNACKRX 1 1 0
MPTcpExtMPTCPRetrans 0 8 8
MPTcpExtMPJoinSynAckRx 1 1 0
MPTcpExtOFOQueueTail 9 8 -1
MPTcpExtOFOQueue 11 9 -2
MPTcpExtOFOMerge 8 7 -1
MPTcpExtDuplicateData 1 0 -1
MPTcpExtSndWndShared 1 0 -1
MPTcpExtRcvWndShared 11 7 -4
Time (ms) (limit : 11921) 11344 12033 689
  • Client side:
Counters Success Error Diff
IpInReceives 1654 933 -721
IpInDelivers 1654 933 -721
IpOutRequests 1585 867 -718
IpOutTransmits 1585 867 -718
TcpPassiveOpens 2 2 0
TcpInSegs 1654 933 -721
TcpOutSegs 1633 906 -727
TcpRetransSegs 1 0 -1
Ip6OutRequests 1 0 -1
Ip6OutMcastPkts 1 0 -1
Ip6OutOctets 56 0 -56
Ip6OutMcastOctets 56 0 -56
Ip6OutTransmits 1 0 -1
Icmp6OutMsgs 1 0 -1
Icmp6OutRouterSolicits 1 0 -1
Icmp6OutType133 1 0 -1
TcpExtPAWSEstab 1 0 -1
TcpExtTW 0 1 1
TcpExtDelayedACKs 7 6 -1
TcpExtTCPPureAcks 20 21 1
TcpExtTCPLossProbes 3 4 1
TcpExtTCPDSACKRecv 1 0 -1
TcpExtTCPOrigDataSent 75 66 -9
TcpExtTCPDelivered 76 66 -10
TcpExtTCPDSACKRecvSegs 1 0 -1
IpExtInOctets 16909204 16947976 38772
IpExtOutOctets 198548 137920 -60628
IpExtInNoECTPkts 12151 12027 -124
MPTcpExtMPCapableSYNRX 1 1 0
MPTcpExtMPCapableACKRX 1 1 0
MPTcpExtMPTCPRetrans 1 0 -1
MPTcpExtMPJoinSynRx 1 1 0
MPTcpExtMPJoinAckRx 1 1 0
MPTcpExtOFOQueueTail 849 492 -357
MPTcpExtOFOQueue 868 504 -364
MPTcpExtOFOMerge 503 405 -98
MPTcpExtDuplicateData   8 8
MPTcpExtRcvWndShared 568 362 -206
MPTcpExtRcvWndConflictUpdate 1   -1
Time (ms) (limit : 11921) 11344 12033 689

(the TcpIn/OutSegs are interesting: not the same number of the client/server: issue? GRO?)

Maybe a first step is to send the patch we have in our tree, because it looks like it is already helping? Or we want a better fix with the new BW limits? cc: @pabeni

@matttbe
Copy link
Member Author

matttbe commented Mar 14, 2024

Note: the same graph as last time, but for the "success" run: (cc: @mjmartineau )

image

And all the captures I had: simult_flows_analyses.zip (cc: @pabeni )

  • "fail": only the 2nd test failed
  • "success": all tests were successful

@mjmartineau
Copy link
Member

Thanks for the success chart. In the failure graph it's interesting that the segment lengths settle in to three main groups around the 6 second mark until the throughput drops around 6.75s:

  • ~34000
  • ~30000
  • <1000

Not sure what it means yet. The success graph maintains the more randomized pattern without the "ceiling" around 34000 for segment length.

MPTCPimporter pushed a commit that referenced this issue May 20, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Message-Id: <20240520-selftests-mptcp-disable-flaky-v1-2-02e23ba0bc3b@kernel.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue May 20, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: multipath-tcp/mptcp_net-next#475
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
MPTCPimporter pushed a commit that referenced this issue May 21, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Message-Id: <20240521-selftests-mptcp-disable-flaky-v2-2-eba143bcbad0@kernel.org>
MPTCPimporter pushed a commit that referenced this issue May 22, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Message-Id: <20240522-selftests-mptcp-disable-flaky-v3-2-a10c68bf2680@kernel.org>
matttbe added a commit that referenced this issue May 23, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 23, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 23, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 24, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 24, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 24, 2024
This reverts commit 7e6c60f.

The previous values helped to find a bug. Better to keep them then, even
if it is slower, because the transfer is longer.

Link: #475
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 24, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 24, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 27, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 27, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 27, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
matttbe added a commit that referenced this issue May 27, 2024
These tests are flaky since their introduction. This might be less or
not visible depending on the CI running the tests, especially if it is
also busy doing other tasks in parallel.

A first analysis shown that the transfer can be slowed down when there
are some re-injections at the MPTCP level. Such re-injections can of
course happen, and disturb the transfer, but it looks strange to have
them in this lab. That could be caused by the kernel having access to
less CPU cycles -- e.g. when other activities are executed in parallel
-- or by a misinterpretation on the MPTCP packet scheduler side.

While this is being investigated, the tests are marked as flaky not to
create noises in other CIs.

Link: #475
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants