[Bug] Unable to connect to any locator of scouted peer #30

josborja7castillo · 2023-12-14T15:48:51Z

Describe the bug

When I try to communicate three nodes (1 x86 PC + 2 ARM64 boards), the message "unable to connect to any locator of scouted peer" appears on my PC showing the IP of one of the ARM boards. After this, no data is exchanged between the two affected endpoints.
Each machine uses CycloneDDS + ROS_LOCALHOST_ONLY=1 environment variable.
Moreover, sudo ip link set lo multicast on is typed before launching the bridge.

In my case, it is important to keep "peer-to-peer" topology instead of router-client.

Not really sure if this is caused by an incorrect configuration or an actual bug. Any feedback about this greatly appreciated.

To reproduce

sudo ip link set lo multicast on on every machine.
Execute zenoh_bridge_ros2dds -i "n1" -c zenoh_config.json5 on ARM machine 1.
Execute zenoh_bridge_ros2dds -i "n2" -c zenoh_config.json5 on ARM machine 2.
Execute zenoh_bridge_ros2dds -i "pc" -c zenoh_config.json5 on x86 PC.
My JSON5 configuration is attached (changed extension due to GitHub extension policies)
zenoh_config.json

System info

PC
Platform: Ubuntu 22.04 with kernel 6.0.2.37
ROS version: Humble with ros-humble-cyclonedds 0.10.3-1jammy.20231117.175619 and ros-humble-rmw-cyclonedds-cpp 1.3.4-1jammy.20231117.183821
Bridge version: main branch according to commit: 83ba7e4

ARM64
Platform: Ubuntu with kernel 5.15.0
ROS version: Humble with ros-humble-cyclonedds 0.10.3-1jammy.20231117.170100 and ros-humble-rmw-cyclonedds-cpp 1.3.4-1jammy.20231118.090403
Bridge version: main branch according to commit: 83ba7e4

The text was updated successfully, but these errors were encountered:

gabrik · 2023-12-15T08:59:16Z

Hi @josborja7castillo, your configuration seems good to me.
Can you try to run the basic Python examples: https://github.com/eclipse-zenoh/zenoh-python/tree/master/examples with the same configuration and RUST_LOG=debug enabled?

So execute:

RUST_LOG=debug python3 z_sub.py -c zenoh_config.json5 or ARM machine 1
RUST_LOG=debug python3 z_sub.py -c zenoh_config.json5 or ARM machine 2
RUST_LOG=debug python3 z_pub.py -c zenoh_config.json5 or x86 PC

and then share the log?

josborja7castillo · 2023-12-15T11:00:40Z

Hi @gabrik, thank you for your fast reply.

I am attaching the logs on the pubs, subs and the log on ARM machines 1 & 2.
log_bridge_n1.txt
log_bridge_n2.txt
log_pub_pc.txt
log_sub_n1.txt
log_sub_n2.txt

If you need further explanation about the the interfaces and addresses used, I will be glad to do so.
Greetings.

gabrik · 2023-12-15T11:08:21Z

If you can provide it would be great I see to many addresses and would be good to understand how they are related.

josborja7castillo · 2023-12-15T14:19:07Z

Surely, my setup is as follows:

Machine	Interface 1	Interface 2
ARM 1	192.168.1.102 (Wireless)	192.168.2.2 (Ethernet)
ARM 2	192.168.1.103 (Wireless)	192.168.2.3 (Ethernet)
PC	192.168.1.133 (Ethernet 1)	192.168.2.2 (Ethernet 2)

The idea behind those two interfaces per machine is to control the ARM machines using the Ethernet interface without allowing any other traffic. For that reason, iptables blocks all traffic except SSH traffic over the Ethernet interface.
Those rules do not apply in case of the wireless interface, which is unconstrained.
The remaining addresses comprehend the IPv6 addresses of each interface plus some virtual interfaces which are not
being used right now but, were set up alongside docker installation (I have used docker for other purposes).

Thank you for your feedback @gabrik.

gabrik · 2023-12-15T14:28:55Z

So, if I got it correctly you would like to have Zenoh communication only on the 192.168.1.x interfaces.
I would suggest to configure a listener on each machine on the specific address it should communicate, and thus avoding it to advertise interfaces that are not supposed to be used.

Could you please try this, even with the simple pub/sub examples and share the logs of the results?

josborja7castillo · 2023-12-15T16:42:17Z

You are right, by invoking each bridge with -l tcp/192.168.1.x:7447 it seems that the subscribers are getting the published data, good that is an improvement :). Nevertheless, the message still appears and, I guess that the "transient state" where the first 8 messages are not getting received should not be happening. I double checked this using Wireshark, and, in fact, they are not being sent.

Just to avoid asking too many questions, could you please guide me to a more-in-depth information where I could find the connection process, the effects of reliability settings and so on?

log_bridge_n1_after_l.txt
log_bridge_n2_after_l.txt
log_bridge_pc_after_l.txt
log_pub_after_l.txt
log_sub_n1_after_l.txt
log_sub_n2_after_l.txt

I attach the log files in case it helps.
Thank you again for your kindness.

gabrik · 2023-12-15T17:25:25Z

You are right, by invoking each bridge with -l tcp/192.168.1.x:7447 it seems that the subscribers are getting the published data, good that is an improvement :).
Glad that helped.
Nevertheless, the message still appears and, I guess that the "transient state" where the first 8 messages are not getting received should not be happening. I double checked this using Wireshark, and, in fact, they are not being sent.

I guess it is normal, as Zenoh publishers do not wait for subscribers before sending messages, nor cache the already sent messages to allow the subscriber to retrieve them.

At least with the default configuration I guess what you are trying to have there is a TRANSIENT LOCAL behavior, that could be achieved with the Publication Cache+QueringSubscriber: https://github.com/eclipse-zenoh/zenoh/tree/master/zenoh-ext/examples

That said, I'm not sure how the plugin should be configured to enable this behavior.
Let me add my colleague @JEnoch to the discussion he knows more than me on this matter.

Just to avoid asking too many questions, could you please guide me to a more-in-depth information where I could find the connection process, the effects of reliability settings and so on?

Session establishment is defined here: https://github.com/eclipse-zenoh/zenoh/tree/master/io/zenoh-transport/src/unicast/establishment

TL;DR;
Zenoh first discovers the peers using scouting, then once the connections are up it exchanges information about the subscriptions, and based on that it creates the routing table.
Thus, as the system is decentralized it is impossible to know where all entities are discovered and all subscriptions propagated.
That's why you see some messages dropped it is "normal", both PublicationCache and QueringSubscriber you can alleviate this issue.

JEnoch · 2023-12-20T17:22:43Z

Hi @josborja7castillo ,

Gabrik is right: the z_pub and z_sub examples are equivalent to DDS pub/sub with VOLATILE as durability QoS. There is no re-publication of historical data to late-joiner subscribers.
The equivalent to DDS pub/sub with TRANSIENT_LOCAL as durability QoS are the z_pub_cache and z_query_sub examples that are available in Rust or C.

But now that your connectivity issue is solved, you can try with ROS 2 Nodes using TRANSIENT_LOCAL QoS.
Please confirm if it works.

josborja7castillo · 2023-12-26T10:28:06Z

Hi @JEnoch and @gabrik , thank you for your feedback.

I will try your suggestions as soon as possible. Sadly, my office is going to be closed for the next two weeks.

Cheers.

imstevenpmwork · 2024-04-10T15:03:08Z

Hello @josborja7castillo!
Did you manage to try the suggestions from above? Let us know :)

josborja7castillo added the bug Something isn't working label Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Unable to connect to any locator of scouted peer #30

[Bug] Unable to connect to any locator of scouted peer #30

josborja7castillo commented Dec 14, 2023 •

edited

gabrik commented Dec 15, 2023 •

edited

josborja7castillo commented Dec 15, 2023

gabrik commented Dec 15, 2023

josborja7castillo commented Dec 15, 2023

gabrik commented Dec 15, 2023

josborja7castillo commented Dec 15, 2023

gabrik commented Dec 15, 2023

JEnoch commented Dec 20, 2023

josborja7castillo commented Dec 26, 2023

imstevenpmwork commented Apr 10, 2024

[Bug] Unable to connect to any locator of scouted peer #30

[Bug] Unable to connect to any locator of scouted peer #30

Comments

josborja7castillo commented Dec 14, 2023 • edited

Describe the bug

To reproduce

System info

gabrik commented Dec 15, 2023 • edited

josborja7castillo commented Dec 15, 2023

gabrik commented Dec 15, 2023

josborja7castillo commented Dec 15, 2023

gabrik commented Dec 15, 2023

josborja7castillo commented Dec 15, 2023

gabrik commented Dec 15, 2023

JEnoch commented Dec 20, 2023

josborja7castillo commented Dec 26, 2023

imstevenpmwork commented Apr 10, 2024

josborja7castillo commented Dec 14, 2023 •

edited

gabrik commented Dec 15, 2023 •

edited