Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent network disconnection causes a crash #1918

Open
JTomasi opened this issue Oct 31, 2023 · 0 comments
Open

Persistent network disconnection causes a crash #1918

JTomasi opened this issue Oct 31, 2023 · 0 comments

Comments

@JTomasi
Copy link

JTomasi commented Oct 31, 2023

Issue details

When running mavros via UDP and a network/radio disconnection event occurs, an intended retry in the libmavconn udp.cpp file can cause unintended, potentially dangerous behaviour.

Overview:

In the libmavconn/src/udp.cpp send_message() function, MAVROS is using boost's IO Service (specifically uses post() on lines 249 & 275) to make an async call of do_sendto(). The do_sento() function uses async_send_to() from the boost ip::udp::socket class to read from a message queue and send the data to an endpoint. During a prolonged network disconnection, the system can crash.

Specifics

When there is a network/radio disconnection during message transmission, async_send_to() returns an asio::error::network_unreachable error. This error is caught on line 324 where the issue is logged. For this particular type of error, the function does not return. Rather, the function continues to run and there is a comment in the code indicating that this is intended:

    if (error == asio::error::network_unreachable) {
        CONSOLE_BRIDGE_logWarn(PFXd "sendto: %s, retrying", sthis->conn_id, error.message().c_str());
        // do not return, try to resend
    } else if (error) {
        CONSOLE_BRIDGE_logError(PFXd "sendto: %s", sthis->conn_id, error.message().c_str());
        sthis->close();
        return;
    }

Since the message queue is not empty, execution continues. Since the network is disconnected, no bytes are transferred and the position in the message buffer is not modified. Execution continues to line 348, in which do_sendto() is recursively called. If the network does not reestablish, this causes a loop in which async_send_to() continues to error out with asio::error::network_unreachable and without any data being transmitted. This recalling of do_sendto() will loop continuously and basically hang. Eventually, the recursive mutex will reach the maximum number of locks that can be applied which will cause a crash, or, if we continue to call send_message(), other threads will wait until the recursive mutex is unlocked which will never happen. This will cause all threads to hang until eventually we run out of available threads.

If there is any misunderstanding in the behaviour of this function on my end, please let me know.

Questions

Is this an intentional design choice for Mavros?

Is there a better/safer way to handle this particular issue when the radio fails without continually waiting and crashing?

MAVROS version and platform

Mavros: 1.17.0
ROS: Noetic
Ubuntu: 22.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant