Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Remote bridge constantly retires writers on UDP #37

Open
ciandonovan opened this issue Dec 20, 2023 · 4 comments
Open

[Bug] Remote bridge constantly retires writers on UDP #37

ciandonovan opened this issue Dec 20, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@ciandonovan
Copy link

Describe the bug

No issues with TCP, but with the exact same configuration with UDP I get about a second or two of streaming, followed by a barrage of messages saying "Remote bridge {GUID} retires {Publisher/Service/Action/etc.}" and then "Route Publisher (ROS:/{TOPIC} -> Zenoh:{TOPIC}) removed"

Connectivity isn't an issue since just replacing udp with tcp in the command argument everything works fine.

Using CycloneDDS configured on the localhost only, and loopback multicast force-enabled.

For extra context, running around 60 nodes with 130 topics on a single PC, a lot from the Nav2 stack. WiFi bandwidth at least 150 Mbit/s. When streaming over TCP, around 80 Mbit/s down. Running in Podman OCI containers for convenience, but previously reproduced outside of containers too. Devices both on the same LAN.

To reproduce

  1. Start Zenoh bridge with config -l udp/0.0.0.0:7447
  2. Run client/peer with config -e udp/{ BRIDGE_IP}:7447

System info

  • Platform: Official Open Source Robotics Foundation Docker image (Humble) running on Debian 12 host.
  • CPU: AMD Ryzen Embedded V2718 with Radeon Graphics
  • Zenoh verison/commit: 2c52d0b
@ciandonovan ciandonovan added the bug Something isn't working label Dec 20, 2023
@JEnoch
Copy link
Member

JEnoch commented Jan 4, 2024

I reproduce similar issue with ROS 2 video streaming over WiFi via UDP:

  • laptop 1:
    • ros2 run v4l2_camera v4l2_camera_node
    • RUST_LOG=zenoh_transport=debug zenoh-bridge-ros2dds -l udp/0.0.0.0:7447
  • laptop2:
    • RUST_LOG=zenoh_transport=debug zenoh-bridge-ros2dds -l udp/0.0.0.0:7447
    • ros2 run rqt_image_view rqt_image_view

As soon as rqt_image_view subscribes to /image_raw such logs appear for bridge on laptop 1:

[2024-01-04T16:51:09Z DEBUG zenoh_transport::unicast::universal::link] Expected SN 147887862, received 147887863 at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/common/defragmentation.rs:68.
[2024-01-04T16:51:09Z DEBUG zenoh_transport::unicast::universal::transport] [d7eaa1bb7fd74a8ec8680d91acc8865c] Closing transport with peer: 68a0035cb1f9064dc7e7fac3d134876
[2024-01-04T16:51:09Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:51329, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.
[2024-01-04T16:51:10Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:52615, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.

Sometime:

[2024-01-04T16:28:43Z DEBUG zenoh_transport::unicast::universal::link] Transport: 9f4fd272bb565b5d9834e8b4342cdc3e. Defragmentation error. at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/universal/rx.rs:153.
[2024-01-04T16:28:43Z DEBUG zenoh_transport::unicast::universal::transport] [e903da926a8f398346b8d7e56dd2ef83] Closing transport with peer: 9f4fd272bb565b5d9834e8b4342cdc3e
[2024-01-04T16:28:43Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:56373, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.
[2024-01-04T16:28:44Z DEBUG zenoh_transport::unicast::establishment::open] Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: udp/192.200.40.10:55246, dst: udp/192.200.40.18:7447, mtu: 9216, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 9216, is_streamed: false, is_compression: false } } } at /Users/julienenoch/.cargo/git/checkouts/zenoh-cc237f2570fab813/780ec60/io/zenoh-transport/src/unicast/establishment/open.rs:444.

It seems that some UDP frames are lost or received malformed, which is usual over WiFi (as there are collisions and UDP is not reliable). Still, this makes Zenoh to close the connection. Moreover the remote bridge seems to not be aware of this closure (close message lost?) and the reconnection is refused.

@Mallets : I'm not sure this is the behaviour we want for Zenoh over UDP for non-reliable publications.
I think loss of fragments shall lead not lead to disconnection, but to just to drop the message, right ?

@ciandonovan : with significant traffic over WiFi there are always UDP frames loss. Zenoh doesn't yet implement a reliability protocol over UDP transport, meaning even DDS RELIABLE topics won't actually be reliable when routed by the zenoh-bridge-ros2dds over UDP.
If you need reliability, you should use TCP or QUIC instead of UDP, for the time being.

@ciandonovan
Copy link
Author

@JEnoch: thanks for that insight, will experiment with QUIC - TLS/mTLS is a requirement for that though right? Currently not using it with TCP as it's already wrapped in a Wireguard VPN.

@Mallets : I'm not sure this is the behaviour we want for Zenoh over UDP for non-reliable publications.
I think loss of fragments shall lead not lead to disconnection, but to just to drop the message, right ?

This sounds ideal. Don't need reliability personally over Zenoh, even for DDS RELIABLE topics, as that reliability is set for intra robot communication, with the Zenoh bridge for real-time remote monitoring where latency is more important.

The reason I was experimenting with UDP was that I discovered Zenoh through this blog https://zenoh.io/blog/2021-09-28-iac-experiences-from-the-trenches/, and UDP was used there. Maybe the CISCO Ultra-Reliable Wireless Backhaul (CURWB) is good enough compared to WiFi that this issue doesn't arise?

@ciandonovan
Copy link
Author

Does QUIC solve the head-of-line issue with TCP for Zenoh here too? As in, a larger, slower topic, being retransmitted won't hold up other higher frequency low-bandwidth topics as they'd be separate streams?

I've found anecdotally that the robot is much less responsive to /joy commands (couple of kilobytes) when run alongside a couple of megabytes of /image topics, despite there being significant bandwidth remaining. Naturally there will always be a decrease, but I'm wondering if it's exacerbated by TCP vs over QUIC?

@JEnoch
Copy link
Member

JEnoch commented Jan 4, 2024

will experiment with QUIC - TLS/mTLS is a requirement for that though right?

Unfortunately, yes - TLS is required by QUIC. But you could just use a same self-signed certificate for all.

Maybe the CISCO Ultra-Reliable Wireless Backhaul is good enough compared to WiFi that this issue doesn't arise?

Possibly. But I also think that in case of Indy Autonomous Challenge, they don't have such big data that need fragmentation when routed over Zenoh. The problem I see (closing connection and can't reconnect) is tied to fragmentation in case of missing fragments.

Does QUIC solve the head-of-line issue with TCP for Zenoh here too?

Probably not yet. As I understand QUIC improves HOL issues if several streams are used within a same QUIC connection. HOL blocking can still occur for a stream, but that won't affect the other streams.
Zenoh uses only 1 bi-directional stream so far. We would need to use several, and to bind those to priority levels (as binding per key-expression is not an option since it will likely hit some max number of streams).
Then the bridge would need to map the DDS Priority QoS to a Zenoh priority.
Finally, you would need to make sure your ROS nodes use different Priority QoS for the relevant topics.

That would indeed be a nice evolution to implement.
I suggest you make some tests with current QUIC implementation at first, and then let us know if you still see HOL blocking. Then we'll consider if we add this to the short term roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants