You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running mavros via UDP and a network/radio disconnection event occurs, an intended retry in the libmavconn udp.cpp file can cause unintended, potentially dangerous behaviour.
Overview:
In the libmavconn/src/udp.cppsend_message() function, MAVROS is using boost's IO Service (specifically uses post() on lines 249 & 275) to make an async call of do_sendto(). The do_sento() function uses async_send_to() from the boost ip::udp::socket class to read from a message queue and send the data to an endpoint. During a prolonged network disconnection, the system can crash.
Specifics
When there is a network/radio disconnection during message transmission, async_send_to() returns an asio::error::network_unreachable error. This error is caught on line 324 where the issue is logged. For this particular type of error, the function does not return. Rather, the function continues to run and there is a comment in the code indicating that this is intended:
if (error == asio::error::network_unreachable) {
CONSOLE_BRIDGE_logWarn(PFXd "sendto: %s, retrying", sthis->conn_id, error.message().c_str());
// do not return, try to resend
} else if (error) {
CONSOLE_BRIDGE_logError(PFXd "sendto: %s", sthis->conn_id, error.message().c_str());
sthis->close();
return;
}
Since the message queue is not empty, execution continues. Since the network is disconnected, no bytes are transferred and the position in the message buffer is not modified. Execution continues to line 348, in which do_sendto() is recursively called. If the network does not reestablish, this causes a loop in which async_send_to() continues to error out with asio::error::network_unreachable and without any data being transmitted. This recalling of do_sendto() will loop continuously and basically hang. Eventually, the recursive mutex will reach the maximum number of locks that can be applied which will cause a crash, or, if we continue to call send_message(), other threads will wait until the recursive mutex is unlocked which will never happen. This will cause all threads to hang until eventually we run out of available threads.
If there is any misunderstanding in the behaviour of this function on my end, please let me know.
Questions
Is this an intentional design choice for Mavros?
Is there a better/safer way to handle this particular issue when the radio fails without continually waiting and crashing?
MAVROS version and platform
Mavros: 1.17.0
ROS: Noetic
Ubuntu: 22.04
The text was updated successfully, but these errors were encountered:
Issue details
When running mavros via UDP and a network/radio disconnection event occurs, an intended retry in the libmavconn udp.cpp file can cause unintended, potentially dangerous behaviour.
Overview:
In the
libmavconn/src/udp.cpp
send_message()
function, MAVROS is using boost's IO Service (specifically uses post() on lines 249 & 275) to make an async call ofdo_sendto()
. The do_sento() function usesasync_send_to()
from theboost ip::udp::socket
class to read from a message queue and send the data to an endpoint. During a prolonged network disconnection, the system can crash.Specifics
When there is a network/radio disconnection during message transmission,
async_send_to()
returns anasio::error::network_unreachable
error. This error is caught on line 324 where the issue is logged. For this particular type of error, the function does not return. Rather, the function continues to run and there is a comment in the code indicating that this is intended:Since the message queue is not empty, execution continues. Since the network is disconnected, no bytes are transferred and the position in the message buffer is not modified. Execution continues to line 348, in which
do_sendto()
is recursively called. If the network does not reestablish, this causes a loop in whichasync_send_to()
continues to error out withasio::error::network_unreachable
and without any data being transmitted. This recalling ofdo_sendto()
will loop continuously and basically hang. Eventually, the recursive mutex will reach the maximum number of locks that can be applied which will cause a crash, or, if we continue to callsend_message()
, other threads will wait until the recursive mutex is unlocked which will never happen. This will cause all threads to hang until eventually we run out of available threads.If there is any misunderstanding in the behaviour of this function on my end, please let me know.
Questions
Is this an intentional design choice for Mavros?
Is there a better/safer way to handle this particular issue when the radio fails without continually waiting and crashing?
MAVROS version and platform
Mavros: 1.17.0
ROS: Noetic
Ubuntu: 22.04
The text was updated successfully, but these errors were encountered: