[Bug] Remote bridge constantly retires writers on UDP #37

ciandonovan · 2023-12-20T13:14:45Z

Describe the bug

No issues with TCP, but with the exact same configuration with UDP I get about a second or two of streaming, followed by a barrage of messages saying "Remote bridge {GUID} retires {Publisher/Service/Action/etc.}" and then "Route Publisher (ROS:/{TOPIC} -> Zenoh:{TOPIC}) removed"

Connectivity isn't an issue since just replacing udp with tcp in the command argument everything works fine.

Using CycloneDDS configured on the localhost only, and loopback multicast force-enabled.

For extra context, running around 60 nodes with 130 topics on a single PC, a lot from the Nav2 stack. WiFi bandwidth at least 150 Mbit/s. When streaming over TCP, around 80 Mbit/s down. Running in Podman OCI containers for convenience, but previously reproduced outside of containers too. Devices both on the same LAN.

To reproduce

Start Zenoh bridge with config -l udp/0.0.0.0:7447
Run client/peer with config -e udp/{ BRIDGE_IP}:7447

System info

Platform: Official Open Source Robotics Foundation Docker image (Humble) running on Debian 12 host.
CPU: AMD Ryzen Embedded V2718 with Radeon Graphics
Zenoh verison/commit: 2c52d0b

The text was updated successfully, but these errors were encountered:

JEnoch · 2024-01-04T17:27:13Z

I reproduce similar issue with ROS 2 video streaming over WiFi via UDP:

laptop 1:
- ros2 run v4l2_camera v4l2_camera_node
- RUST_LOG=zenoh_transport=debug zenoh-bridge-ros2dds -l udp/0.0.0.0:7447
laptop2:
- RUST_LOG=zenoh_transport=debug zenoh-bridge-ros2dds -l udp/0.0.0.0:7447
- ros2 run rqt_image_view rqt_image_view

As soon as rqt_image_view subscribes to /image_raw such logs appear for bridge on laptop 1:

[2024-01-04T16:51:09Z DEBUG zenoh_transport::unicast::universal::link] Expected SN 147887862, received 147887863 at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/common/defragmentation.rs:68.
[2024-01-04T16:51:09Z DEBUG zenoh_transport::unicast::universal::transport] [d7eaa1bb7fd74a8ec8680d91acc8865c] Closing transport with peer: 68a0035cb1f9064dc7e7fac3d134876
[2024-01-04T16:51:09Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:51329, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.
[2024-01-04T16:51:10Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:52615, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.

Sometime:

[2024-01-04T16:28:43Z DEBUG zenoh_transport::unicast::universal::link] Transport: 9f4fd272bb565b5d9834e8b4342cdc3e. Defragmentation error. at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/universal/rx.rs:153.
[2024-01-04T16:28:43Z DEBUG zenoh_transport::unicast::universal::transport] [e903da926a8f398346b8d7e56dd2ef83] Closing transport with peer: 9f4fd272bb565b5d9834e8b4342cdc3e
[2024-01-04T16:28:43Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:56373, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.
[2024-01-04T16:28:44Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:55246, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.

It seems that some UDP frames are lost or received malformed, which is usual over WiFi (as there are collisions and UDP is not reliable). Still, this makes Zenoh to close the connection. Moreover the remote bridge seems to not be aware of this closure (close message lost?) and the reconnection is refused.

@Mallets : I'm not sure this is the behaviour we want for Zenoh over UDP for non-reliable publications.
I think loss of fragments shall lead not lead to disconnection, but to just to drop the message, right ?

@ciandonovan : with significant traffic over WiFi there are always UDP frames loss. Zenoh doesn't yet implement a reliability protocol over UDP transport, meaning even DDS RELIABLE topics won't actually be reliable when routed by the zenoh-bridge-ros2dds over UDP.
If you need reliability, you should use TCP or QUIC instead of UDP, for the time being.

ciandonovan · 2024-01-04T18:45:11Z

@JEnoch: thanks for that insight, will experiment with QUIC - TLS/mTLS is a requirement for that though right? Currently not using it with TCP as it's already wrapped in a Wireguard VPN.

@Mallets : I'm not sure this is the behaviour we want for Zenoh over UDP for non-reliable publications.
I think loss of fragments shall lead not lead to disconnection, but to just to drop the message, right ?

This sounds ideal. Don't need reliability personally over Zenoh, even for DDS RELIABLE topics, as that reliability is set for intra robot communication, with the Zenoh bridge for real-time remote monitoring where latency is more important.

The reason I was experimenting with UDP was that I discovered Zenoh through this blog https://zenoh.io/blog/2021-09-28-iac-experiences-from-the-trenches/, and UDP was used there. Maybe the CISCO Ultra-Reliable Wireless Backhaul (CURWB) is good enough compared to WiFi that this issue doesn't arise?

ciandonovan · 2024-01-04T19:06:17Z

Does QUIC solve the head-of-line issue with TCP for Zenoh here too? As in, a larger, slower topic, being retransmitted won't hold up other higher frequency low-bandwidth topics as they'd be separate streams?

I've found anecdotally that the robot is much less responsive to /joy commands (couple of kilobytes) when run alongside a couple of megabytes of /image topics, despite there being significant bandwidth remaining. Naturally there will always be a decrease, but I'm wondering if it's exacerbated by TCP vs over QUIC?

JEnoch · 2024-01-04T19:36:35Z

will experiment with QUIC - TLS/mTLS is a requirement for that though right?

Unfortunately, yes - TLS is required by QUIC. But you could just use a same self-signed certificate for all.

Maybe the CISCO Ultra-Reliable Wireless Backhaul is good enough compared to WiFi that this issue doesn't arise?

Possibly. But I also think that in case of Indy Autonomous Challenge, they don't have such big data that need fragmentation when routed over Zenoh. The problem I see (closing connection and can't reconnect) is tied to fragmentation in case of missing fragments.

Does QUIC solve the head-of-line issue with TCP for Zenoh here too?

Probably not yet. As I understand QUIC improves HOL issues if several streams are used within a same QUIC connection. HOL blocking can still occur for a stream, but that won't affect the other streams.
Zenoh uses only 1 bi-directional stream so far. We would need to use several, and to bind those to priority levels (as binding per key-expression is not an option since it will likely hit some max number of streams).
Then the bridge would need to map the DDS Priority QoS to a Zenoh priority.
Finally, you would need to make sure your ROS nodes use different Priority QoS for the relevant topics.

That would indeed be a nice evolution to implement.
I suggest you make some tests with current QUIC implementation at first, and then let us know if you still see HOL blocking. Then we'll consider if we add this to the short term roadmap.

ciandonovan added the bug Something isn't working label Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Remote bridge constantly retires writers on UDP #37

[Bug] Remote bridge constantly retires writers on UDP #37

ciandonovan commented Dec 20, 2023

JEnoch commented Jan 4, 2024

ciandonovan commented Jan 4, 2024

ciandonovan commented Jan 4, 2024

JEnoch commented Jan 4, 2024

[Bug] Remote bridge constantly retires writers on UDP #37

[Bug] Remote bridge constantly retires writers on UDP #37

Comments

ciandonovan commented Dec 20, 2023

Describe the bug

To reproduce

System info

JEnoch commented Jan 4, 2024

ciandonovan commented Jan 4, 2024

ciandonovan commented Jan 4, 2024

JEnoch commented Jan 4, 2024