Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no DDS Writer after 3s - drop incoming data (broker topology - client mode) #151

Open
JEnoch opened this issue Oct 6, 2023 · 3 comments

Comments

@JEnoch
Copy link
Member

JEnoch commented Oct 6, 2023

Discussed in eclipse-zenoh/roadmap#94

Originally posted by gtoff October 4, 2023
Hi,

we are running a brokered topology with one router in a K8S cluster, one remote robot, and one or more K8S pod clients (running rviz and zenoh-dds-bridge) in the same cluster.
We noticed something we did not expect. If we connect in "client" mode (-m client) with the pods running rviz, we only seem to be able to connect 1 client at a time. The other gets a series of the following warning for each topic:

WARN zenoh_plugin_dds::route_zenoh_dds] Route Zenoh->DDS (rt/tf -> rt/tf): still no DDS Writer after 3s - drop incoming data!

If we switch the mode to -peer (which we don't want, we'd like communication to go over the router and no peer discovery to happen) then messages come through.

Any idea what this could be?
I am not even sure I understand the warning: is the bridge complaining that it cannot create a local DDS writer?

@JEnoch JEnoch transferred this issue from eclipse-zenoh/roadmap Oct 6, 2023
@JEnoch
Copy link
Member Author

JEnoch commented Oct 6, 2023

Hi @gtoff,

I transferred your question as an issue here in eclipse-zenoh/zenoh-plugin-dds, since it's related to this bridge.

The still no DDS Writer after 3s message occurs when a bridge runs in forwarding mode and discovered a DDS Reader.
In this mode, it prepares a route for it, creating a Zenoh subscriber, forwarding the discovery info to the remote bridges (so they can declare a DDS Reader that will be discovered by remote ROS Nodes).
But it doesn't yet creates a DDS Writer that will route data coming via Zenoh to this discovered DDS Reader. Only when a remote bridge forwards the discovery information of a DDS Writer, this route will be completed with the creation of this DDS Writer with the same QoS than the DDS Writer announced by the remote bridge.

What happens in your case is that a local DDS Reader has been discovered, but no remote bridge forwarded the discovery info of a DDS Writer. Sill, some data are received via Zenoh, are kept on hold during 3 seconds (waiting for a discovery info message, in case of order inversion), but are eventually dropped.

Are all your bridges well configured with -f or --fwd-discovery option ?

@gtoff
Copy link

gtoff commented Oct 6, 2023

Thank you @JEnoch,

so the no DDS Writer warning must be unrelated to the issue.
Indeed, we run the bridge with -f option because we still want to be able to build a complete ROS graph with rqt (for teaching purposes).

To give more context, we are currently just running rqt / rviz in the k8s pods and all pods have the same hostname and will start ROS nodes with the same name. Could this be the reason why messages don't go through?
We also don't see this happening with lightweight applications, but once we start with more heavyweight topics (e.g., images) only one of the clients gets the data. We are talking about peaks of 15MBps, so I think we're far from saturating the infrastructure...

@JEnoch
Copy link
Member Author

JEnoch commented Oct 9, 2023

Another user also reported on our Discord some issues within a K8S environment.
I reproduced his deployment and saw strange behaviour in his Gazebo pod: the bridge was discovering DDS entities only after few seconds, while in other pods it was in the order of milliseconds.
This made me think that something goes wrong in the network traffic. Possibly some congestion with messages being delayed in some queue or buffer.

As far as I understood the K8S network is virtualized. This can cause a different behaviour than with a ethernet or loopback network. I'm not sure how to investigate this, but will try to after the ROSCon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants