Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve server/client stuck at the test end before results-exchange #1527

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

davidBar-On
Copy link
Contributor

@davidBar-On davidBar-On commented Jun 3, 2023

A suggested fix for the issue that server and client got stuck at the end of the test, because the client did not receive EXCHANGE_RESULTS state, based on @ffolkes1911 test results and discussion starting at this comment.

The issue was caused on a cellular network, so the fix seem to be important for any iperf3 use over cellular network. As the changes are for the end of the test, they are probably related also to the multi-thread version.

The root cause of the issue was that the server sent (reverse mode) only about 60KB from each 128KB TCP packet. Therefore, the last read by the client did not receive a full 128KB packet. Since the read was in blocking-mode the client got stuck and was not able to respond to the server. Therefore, the server also got stuck while waiting for the client's reply. It is not clear whether the EXCHANGE_RESULTS was lost or whether it was just delayed, but the fix handles both cases.

The main changes done:

  1. Client does not set the TCP streams to blocking mode at TEST_END. This is to allow the client to receive non-full late packets.
  2. Allow --rcv-timeout in the client in sending mode - used at the end of the test to allow timeout for exchange-results messages read.
  3. Server does not close the test streams at TEST_END. This is because select() returns immediately when monitoring closed sockets, and the client's select() timeout is not effective in this case.
  4. Client does not monitor "write sockets" at TEST_END. Otherwise, even when no input is received from the server, client's select() for late packets will return immediately (as the write sockets are available) which will not allow using the --rcv-timeout.
  5. When client sends IPERF_DONE or CLIENT_TERMINATE, it sends additional 3 null bytes. This will make sure that in case the server is waiting for the exchanged-results JSON length (4 bytes), it will not get stuck in the read command (server will then fail because it doesn't get a legal JSON).
  6. Cancel client timers at TEST_END, as they are redundant at that point and may interrupt with the --rcv-timeout , because select() may return after a timer expires and before the receive timer expired.

@davidBar-On
Copy link
Contributor Author

Re-implementation for the Multi-thread iperf3. I am not sure whether the original problem still happen with multi-thread. Some of the changes may be worth implementing in any case, so if needed I can submit a PR only with these changes:

  • Removed write_set as it became redundant with multi-thread.
  • Added timers canceling function, instead of canceling the timers one by one in several places.
  • Cancel client timers at TEST_END, as they are redundant at that point and may interrupt with the --rcv-timeout, because select() may return after a timer expires and before the receive timer expired.
  • Added sp->socket = -1 in two places, which makes the code cleaner.
  • Added some debug messages that may be useful in general.

The PR also include the code changes to resolve the original issue:

  • Allow --rcv-timeout in the client in sending mode - used at the end of the test to allow timeout for exchange-results messages read.
  • When client sends IPERF_DONE or CLIENT_TERMINATE, it sends additional 3 null bytes. This will make sure that in case the server is waiting for the exchanged-results JSON length (4 bytes), it will not get stuck in the read command (server will then fail because it doesn't get a legal JSON).

@@ -562,7 +545,6 @@ iperf_run_server(struct iperf_test *test)
}

memcpy(&read_set, &test->read_set, sizeof(fd_set));
memcpy(&write_set, &test->write_set, sizeof(fd_set));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this means that write_set may have an uninitialized values on line 581. Should you just remove the variable entirely since it it unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Forgot to remove all occurances of write_set... Now removed (with rebase).

@davidBar-On davidBar-On force-pushed the issue-819-server-stuck-after-printing-results branch from dbe8d68 to fcf1c46 Compare May 10, 2024 06:50
@@ -580,8 +583,9 @@ iperf_run_client(struct iperf_test * test)

/* Begin calculating CPU utilization */
cpu_util(NULL);
rcv_timeout_value_in_us = (test->settings->rcv_timeout.secs * SEC_TO_US) + test->settings->rcv_timeout.usecs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change creates a scenario with an early timeout.
Given:

  • test->mode == SENDER
  • rcv_timeout_value_in_us > 0
  • test->state == TEST_END (or EXCHANGE_RESULTS or DISPLAY_RESULTS)

This implies that rcv_timeout_us = 0. Then on line 642, the if statement will evaluate to something like if (t_usecs > 0); always being true since t_usecs will pretty much always be greater than 0.

It might make more sense to split the blocks so the correct timeout is used:
(I've also renamed rcv_timeout_value_in_us -> end_rcv_timeout and rcv_timeout_us -> running_rcv_timeout to hopefully make their use more clear)

    if (result < 0 && errno != EINTR) {
  	    i_errno = IESELECT;
	    goto cleanup_and_fail;
    } else if ( result == 0 && (running_rcv_timeout > 0 && test->state == TEST_RUNNING)) {
        /*
            * If nothing was received in non-reverse running state
            * then probably something got stuck - either client,
            * server or network, and test should be terminated./
            */
        iperf_time_now(&now);
        if (iperf_time_diff(&now, &last_receive_time, &diff_time) == 0) {
            t_usecs = iperf_time_in_usecs(&diff_time);
            if (t_usecs > running_rcv_timeout) {
                /* Idle timeout if no new blocks received */
                if (test->blocks_received == last_receive_blocks) {
                    i_errno = IENOMSG;
                    goto cleanup_and_fail;
                }
            }

        }
    } else if (result == 0 && (end_rcv_timeout > 0 && (test->state == TEST_END 
                                        || test->state == EXCHANGE_RESULTS 
                                        || test->state == DISPLAY_RESULTS))) {
        iperf_time_now(&now);
        if (iperf_time_diff(&now, &last_receive_time, &diff_time) == 0) {
            t_usecs = iperf_time_in_usecs(&diff_time);
            if (t_usecs > end_rcv_timeout) {
                /* Idle timeout if no new blocks received */
                if (test->blocks_received == last_receive_blocks) {
                    i_errno = IENOMSG;
                    goto cleanup_and_fail;
                }
            }

        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only rcv_timeout_us value is used for timeout (in the select statement). The value of rcv_timeout_value_in_us is only used to initialize rcv_timeout_us and to indicate for for the ending states whether receive timeout was requested (as there is no "timeout requested" setting). Therefore, the current code is correct in that respect.

I agree that probably rcv_timeout_value_in_us may have a better name, like rcv_timeout_setting_us but I won't make a change just for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants