Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unable to connect to any locator of scouted peer #30

Open
josborja7castillo opened this issue Dec 14, 2023 · 10 comments
Open

[Bug] Unable to connect to any locator of scouted peer #30

josborja7castillo opened this issue Dec 14, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@josborja7castillo
Copy link

josborja7castillo commented Dec 14, 2023

Describe the bug

When I try to communicate three nodes (1 x86 PC + 2 ARM64 boards), the message "unable to connect to any locator of scouted peer" appears on my PC showing the IP of one of the ARM boards. After this, no data is exchanged between the two affected endpoints.
Each machine uses CycloneDDS + ROS_LOCALHOST_ONLY=1 environment variable.
Moreover, sudo ip link set lo multicast on is typed before launching the bridge.

In my case, it is important to keep "peer-to-peer" topology instead of router-client.

Not really sure if this is caused by an incorrect configuration or an actual bug. Any feedback about this greatly appreciated.

To reproduce

  1. sudo ip link set lo multicast on on every machine.
  2. Execute zenoh_bridge_ros2dds -i "n1" -c zenoh_config.json5 on ARM machine 1.
  3. Execute zenoh_bridge_ros2dds -i "n2" -c zenoh_config.json5 on ARM machine 2.
  4. Execute zenoh_bridge_ros2dds -i "pc" -c zenoh_config.json5 on x86 PC.
    My JSON5 configuration is attached (changed extension due to GitHub extension policies)
    zenoh_config.json

System info

PC
Platform: Ubuntu 22.04 with kernel 6.0.2.37
ROS version: Humble with ros-humble-cyclonedds 0.10.3-1jammy.20231117.175619 and ros-humble-rmw-cyclonedds-cpp 1.3.4-1jammy.20231117.183821
Bridge version: main branch according to commit: 83ba7e4

ARM64
Platform: Ubuntu with kernel 5.15.0
ROS version: Humble with ros-humble-cyclonedds 0.10.3-1jammy.20231117.170100 and ros-humble-rmw-cyclonedds-cpp 1.3.4-1jammy.20231118.090403
Bridge version: main branch according to commit: 83ba7e4

@josborja7castillo josborja7castillo added the bug Something isn't working label Dec 14, 2023
@gabrik
Copy link
Contributor

gabrik commented Dec 15, 2023

Hi @josborja7castillo, your configuration seems good to me.
Can you try to run the basic Python examples: https://github.com/eclipse-zenoh/zenoh-python/tree/master/examples with the same configuration and RUST_LOG=debug enabled?

So execute:

  • RUST_LOG=debug python3 z_sub.py -c zenoh_config.json5 or ARM machine 1
  • RUST_LOG=debug python3 z_sub.py -c zenoh_config.json5 or ARM machine 2
  • RUST_LOG=debug python3 z_pub.py -c zenoh_config.json5 or x86 PC

and then share the log?

@josborja7castillo
Copy link
Author

Hi @gabrik, thank you for your fast reply.

I am attaching the logs on the pubs, subs and the log on ARM machines 1 & 2.
log_bridge_n1.txt
log_bridge_n2.txt
log_pub_pc.txt
log_sub_n1.txt
log_sub_n2.txt

If you need further explanation about the the interfaces and addresses used, I will be glad to do so.
Greetings.

@gabrik
Copy link
Contributor

gabrik commented Dec 15, 2023

If you can provide it would be great I see to many addresses and would be good to understand how they are related.

@josborja7castillo
Copy link
Author

Surely, my setup is as follows:

Machine Interface 1 Interface 2
ARM 1 192.168.1.102 (Wireless) 192.168.2.2 (Ethernet)
ARM 2 192.168.1.103 (Wireless) 192.168.2.3 (Ethernet)
PC 192.168.1.133 (Ethernet 1) 192.168.2.2 (Ethernet 2)

The idea behind those two interfaces per machine is to control the ARM machines using the Ethernet interface without allowing any other traffic. For that reason, iptables blocks all traffic except SSH traffic over the Ethernet interface.
Those rules do not apply in case of the wireless interface, which is unconstrained.
The remaining addresses comprehend the IPv6 addresses of each interface plus some virtual interfaces which are not
being used right now but, were set up alongside docker installation (I have used docker for other purposes).

Thank you for your feedback @gabrik.

@gabrik
Copy link
Contributor

gabrik commented Dec 15, 2023

So, if I got it correctly you would like to have Zenoh communication only on the 192.168.1.x interfaces.
I would suggest to configure a listener on each machine on the specific address it should communicate, and thus avoding it to advertise interfaces that are not supposed to be used.

Could you please try this, even with the simple pub/sub examples and share the logs of the results?

@josborja7castillo
Copy link
Author

You are right, by invoking each bridge with -l tcp/192.168.1.x:7447 it seems that the subscribers are getting the published data, good that is an improvement :). Nevertheless, the message still appears and, I guess that the "transient state" where the first 8 messages are not getting received should not be happening. I double checked this using Wireshark, and, in fact, they are not being sent.

Just to avoid asking too many questions, could you please guide me to a more-in-depth information where I could find the connection process, the effects of reliability settings and so on?

log_bridge_n1_after_l.txt
log_bridge_n2_after_l.txt
log_bridge_pc_after_l.txt
log_pub_after_l.txt
log_sub_n1_after_l.txt
log_sub_n2_after_l.txt

I attach the log files in case it helps.
Thank you again for your kindness.

@gabrik
Copy link
Contributor

gabrik commented Dec 15, 2023

You are right, by invoking each bridge with -l tcp/192.168.1.x:7447 it seems that the subscribers are getting the published data, good that is an improvement :).
Glad that helped.
Nevertheless, the message still appears and, I guess that the "transient state" where the first 8 messages are not getting received should not be happening. I double checked this using Wireshark, and, in fact, they are not being sent.

I guess it is normal, as Zenoh publishers do not wait for subscribers before sending messages, nor cache the already sent messages to allow the subscriber to retrieve them.

At least with the default configuration I guess what you are trying to have there is a TRANSIENT LOCAL behavior, that could be achieved with the Publication Cache+QueringSubscriber: https://github.com/eclipse-zenoh/zenoh/tree/master/zenoh-ext/examples

That said, I'm not sure how the plugin should be configured to enable this behavior.
Let me add my colleague @JEnoch to the discussion he knows more than me on this matter.

Just to avoid asking too many questions, could you please guide me to a more-in-depth information where I could find the connection process, the effects of reliability settings and so on?

Session establishment is defined here: https://github.com/eclipse-zenoh/zenoh/tree/master/io/zenoh-transport/src/unicast/establishment

TL;DR;
Zenoh first discovers the peers using scouting, then once the connections are up it exchanges information about the subscriptions, and based on that it creates the routing table.
Thus, as the system is decentralized it is impossible to know where all entities are discovered and all subscriptions propagated.
That's why you see some messages dropped it is "normal", both PublicationCache and QueringSubscriber you can alleviate this issue.

@JEnoch
Copy link
Member

JEnoch commented Dec 20, 2023

Hi @josborja7castillo ,

Gabrik is right: the z_pub and z_sub examples are equivalent to DDS pub/sub with VOLATILE as durability QoS. There is no re-publication of historical data to late-joiner subscribers.
The equivalent to DDS pub/sub with TRANSIENT_LOCAL as durability QoS are the z_pub_cache and z_query_sub examples that are available in Rust or C.

But now that your connectivity issue is solved, you can try with ROS 2 Nodes using TRANSIENT_LOCAL QoS.
Please confirm if it works.

@josborja7castillo
Copy link
Author

Hi @JEnoch and @gabrik , thank you for your feedback.

I will try your suggestions as soon as possible. Sadly, my office is going to be closed for the next two weeks.

Cheers.

@imstevenpmwork
Copy link

Hello @josborja7castillo!
Did you manage to try the suggestions from above? Let us know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants