Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] QoS incompatibility issue New publisher discovered on topic '/bot/firmware/drive_state/front', offering incompatible QoS. No messages will be sent to it. Last incompatible policy: RELIABILITY_QOS_POLICY #154

Open
nazmicancalik opened this issue Oct 30, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@nazmicancalik
Copy link

Describe the bug

Zenoh peers have qos incompatibility issue when there are 2 different Ros QoS subscribers. Consider the following situation:

In the ROS2 Code: There is a Reliable Publisher and 1 Reliable and 1 Best Effort Subscriber to the same topic. To our understanding, sometimes (when zenoh discovers the Best Effort Subscriber DDS first rather than the Reliable one) it gives the following error:

New publisher discovered on topic '/bot/firmware/drive_state/front', offering incompatible QoS. No messages will be sent to it. Last incompatible policy: RELIABILITY_QOS_POLICY

In the zenoh source we have found a TODO that it says check the qos that is existing for a qos and do not create an incompatible one if you receive a new discovery.
Do you have any ideas if this is related or why this problem might occur?

To reproduce

  1. Start a subscriber with BEST_EFFORT QoS to topic a.
  2. Start a subscriber with RELIABLE QoS to topic a.
  3. Start a publisher with RELIABLE QoS to topic a.
  4. Error occurs.

System info

Platform: Ubuntu 22.04.2.LTS 64 bit
Processor: AMD® Ryzen 9 5900hx with radeon graphics × 16
Zenoh-bridge-dds version: zenoh bridge for DDS v0.7.2-rc built with rustc 1.70.0 (90c541806 2023-05-31)

@nazmicancalik nazmicancalik added the bug Something isn't working label Oct 30, 2023
@nazmicancalik
Copy link
Author

nazmicancalik commented Oct 30, 2023

To recreate the issue please run this following script:

 #!/usr/bin/env python3
   
  import rclpy.node
  import rclpy.qos
   
  from std_msgs.msg import String
   
  class ReliablerSubscriber(rclpy.node.Node):
      def __init__(self):
          super().__init__('reliable_subscriber')
   
          self.sub1 = self.create_subscription(
              String, 'topic', self.callback, qos_profile=rclpy.qos.qos_profile_sensor_data)
   
   
          self.sub2 = self.create_subscription(
              String, 'topic', self.callback, qos_profile=rclpy.qos.qos_profile_parameters)
          
   
      def callback(self, msg: String):
          self.get_logger().info('I heard: "%s"' % msg.data)
   
  def main(argv=None):
      rclpy.init(args=argv)
      node = ReliablerSubscriber()
      rclpy.spin(node)
      rclpy.shutdown()
   
  if __name__ == '__main__':
      main()

@JEnoch
Copy link
Member

JEnoch commented Nov 2, 2023

Hi @nazmicancalik,

You correctly analyzed the problem:
for efficiency and simplicity concerns, the bridge creates only 1 DDS Reader or Writer per Topic. The QoS used for those creations depend on the first discovered DDS entity on the Topic (local discovery by default, remote discovery by another bridge in Forward Discovery mode).
Thus, it might happen that the bridge creates a Reader or a Writer with QoS that are not suitable for a Reader/Writer discovered afterward.

A solution could be the one proposed by @Fellfalla in #155 (thanks for this!): delete the Reader or Writer, and re-create it with "upgraded" QoS.

But I see several drawbacks to this approach:

  1. The Reliability QoS is no longer guaranteed, since some publications could be missed in the time interval between the deletion and the re-creation of the DDS entity
  2. In case of TRANSIENT_LOCAL topic, this could lead a Subscriber to receive duplicate messages of historical data, since at Reader re-creation, the historical data will be received and routed again.
  3. It could lead a working configuration to stop working just declaring a Subscriber. Imagine the following scenario:
    • Start a publisher with BEST_EFFORT QoS to topic a.
    • Start a subscriber with BEST_EFFORT QoS to topic a.
      => compatible QoS, communication occurs via the bridge, all is fine
    • Start a subscriber with RELIABLE QoS to topic a.
      => the bridge on publisher side upgrades the QoS of it's DDS Reader to RELIABLE, which is incompatible with the BEST_EFFORT publisher. Thus communication stops.

The other solution I'm considering is to add the possibility to configure the QoS to be used by the bridge for certain topics in its JSON5 config file. Those configured QoS would override the discovered ones.
Thus, when you know that for some topics a mix of QoS settings are used by different Publisher/Subscribers in your system, you can force the bridge to use the configured QoS.

Do you think this is a suitable solution for your use case ?

@Fellfalla
Copy link

Hi @JEnoch thank you for your feedback and thoughts 🙂!

To answer your general question: overwriting QoS from the config is definitely a good and clean solution to us. However, it requires more thorough attention to all the existing and new sub QoSes and thus is a bit less convenient. Thus, maybe a few thoughts on your valid points:

  1. Is the reliability at the time of creation of a reliable subscriber actually relevant? It seems as its only possible to drop msgs if an unreliable qos exists and then a new reliable qos pops up. Thus only msgs for the best effort subscriber might be dropped, or am i mistaken?
  2. Good Point. This actually reminds me of sth. thats already seems to be happening in our stack. Not sure if its actually related. I can try recreate an example setup.
  3. At least on the ROS2 part, this is already an invalid system configuration which broken communication for the reliable subscriber. If we force Reliable via config, the communication wouldnt work either.

Looking forward to hear back from you!

@JEnoch
Copy link
Member

JEnoch commented Dec 1, 2023

Is the reliability at the time of creation of a reliable subscriber actually relevant? It seems as its only possible to drop msgs if an unreliable qos exists and then a new reliable qos pops up. Thus only msgs for the best effort subscriber might be dropped, or am i mistaken?

That's correct.

At least on the ROS2 part, this is already an invalid system configuration which broken communication for the reliable subscriber. If we force Reliable via config, the communication wouldnt work either.

Right, but in a pure ROS2 system the best-effort Publisher to best-effort Subscriber will still occur.
While with the bridges, if we upgrade the bridge's Subscriber to reliable ALL communication will stop on this topic.

@JEnoch
Copy link
Member

JEnoch commented Dec 1, 2023

After reflexion, I think wrt. RELIABLE QoS it would make sense for the bridge to always create a RELIABLE Writer, whatever the discovered QoS. Thus it's always matching any future Readers, whatever their reliability QoS.
And according to CycloneDDS team, this won't make significant difference wrt. resources consumption in the bridge.

I created #165 and eclipse-zenoh/zenoh-plugin-ros2dds#23 to track this, and that will fix this issue.

For other QoS I still think "on-the-fly" adaptation via deletion/re-creation of the Writer might cause troubles, starting with loss of messages for instance in the case a RELIABLE+VOLATILE Writer has to be upgraded to RELIABLE+TRANSIENT_LOCAL.

For this, the possibility for the user to configure the QoS used by the bridge per-topic (or per-group of topics via regex) is a better solution, as more flexible and allowing fine tuning for other purposes than QoS matching issues.
I created #166 and eclipse-zenoh/zenoh-plugin-ros2dds#24 to track this feature.

@JEnoch
Copy link
Member

JEnoch commented Dec 11, 2023

@Fellfalla , now that #165 and eclipse-zenoh/zenoh-plugin-ros2dds#23 have been fixed, could you please check if that solves your issue (with either zenoh-bridge-dds or zenoh-bridge-ros2dds ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants