-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Remote bridge constantly retires writers on UDP #37
Comments
I reproduce similar issue with ROS 2 video streaming over WiFi via UDP:
As soon as
Sometime:
It seems that some UDP frames are lost or received malformed, which is usual over WiFi (as there are collisions and UDP is not reliable). Still, this makes Zenoh to close the connection. Moreover the remote bridge seems to not be aware of this closure (close message lost?) and the reconnection is refused. @Mallets : I'm not sure this is the behaviour we want for Zenoh over UDP for non-reliable publications. @ciandonovan : with significant traffic over WiFi there are always UDP frames loss. Zenoh doesn't yet implement a reliability protocol over UDP transport, meaning even DDS RELIABLE topics won't actually be reliable when routed by the |
@JEnoch: thanks for that insight, will experiment with QUIC - TLS/mTLS is a requirement for that though right? Currently not using it with TCP as it's already wrapped in a Wireguard VPN.
This sounds ideal. Don't need reliability personally over Zenoh, even for DDS RELIABLE topics, as that reliability is set for intra robot communication, with the Zenoh bridge for real-time remote monitoring where latency is more important. The reason I was experimenting with UDP was that I discovered Zenoh through this blog https://zenoh.io/blog/2021-09-28-iac-experiences-from-the-trenches/, and UDP was used there. Maybe the CISCO Ultra-Reliable Wireless Backhaul (CURWB) is good enough compared to WiFi that this issue doesn't arise? |
Does QUIC solve the head-of-line issue with TCP for Zenoh here too? As in, a larger, slower topic, being retransmitted won't hold up other higher frequency low-bandwidth topics as they'd be separate streams? I've found anecdotally that the robot is much less responsive to /joy commands (couple of kilobytes) when run alongside a couple of megabytes of /image topics, despite there being significant bandwidth remaining. Naturally there will always be a decrease, but I'm wondering if it's exacerbated by TCP vs over QUIC? |
Unfortunately, yes - TLS is required by QUIC. But you could just use a same self-signed certificate for all.
Possibly. But I also think that in case of Indy Autonomous Challenge, they don't have such big data that need fragmentation when routed over Zenoh. The problem I see (closing connection and can't reconnect) is tied to fragmentation in case of missing fragments.
Probably not yet. As I understand QUIC improves HOL issues if several streams are used within a same QUIC connection. HOL blocking can still occur for a stream, but that won't affect the other streams. That would indeed be a nice evolution to implement. |
Describe the bug
No issues with TCP, but with the exact same configuration with UDP I get about a second or two of streaming, followed by a barrage of messages saying "Remote bridge {GUID} retires {Publisher/Service/Action/etc.}" and then "Route Publisher (ROS:/{TOPIC} -> Zenoh:{TOPIC}) removed"
Connectivity isn't an issue since just replacing
udp
withtcp
in the command argument everything works fine.Using CycloneDDS configured on the localhost only, and loopback multicast force-enabled.
For extra context, running around 60 nodes with 130 topics on a single PC, a lot from the Nav2 stack. WiFi bandwidth at least 150 Mbit/s. When streaming over TCP, around 80 Mbit/s down. Running in Podman OCI containers for convenience, but previously reproduced outside of containers too. Devices both on the same LAN.
To reproduce
-l udp/0.0.0.0:7447
-e udp/{ BRIDGE_IP}:7447
System info
The text was updated successfully, but these errors were encountered: