Frequency drops with additional subscriber #461

timonegk · 2023-07-04T13:42:46Z

Bug report

Required Info:

Operating System:
- Ubuntu 22.04
Installation type:
- Binaries
Version or commit hash:
- 1.6.0
DDS implementation:
- rmw_cyclonedds_cpp

Steps to reproduce issue

Publish image messages (in our case 512x384 px) with 10 Hz
Subscribe to the topic, receive images with 10 Hz
Add a second subscriber, observe frequency drop to 7 Hz

Expected behavior

The frequency should not drop.

Actual behavior

The frequency drops.

timonegk · 2023-07-04T13:44:31Z

Similar behavior was described here: https://answers.ros.org/question/363978/ros2-image-publisher-slows-down-on-subscription/

dimaxano · 2023-07-20T12:18:47Z

Hi!

I also face the same problem on my robot (but it is not reproducible on my dev laptop though software configuration is the same)

What networking interfaces do you have on the machine where the issue happens? Is the ROS2 running behind some of them?

dimaxano · 2023-07-20T12:28:22Z

I'm also able to reproduce the issue with ddsperf.

ddsperf pub 30Hz size 3MB (terminal 1). 3MB is just an optional number, it is reproducible with lower values also, down to 500Kb in my case
ddsperf sub (terminal 2)
ddsperf sub (terminal 3), etc

But again that is only reproducible on the robot, but not locally.

Environment: docker, Ubuntu 22.04.2 LTS

sysctl vars: 
    net.core.rmem_max = 212992 
    net.core.rmem_default = 212992 

DDS Implementation: cyclone 

Packages Version: 
ros-humble-cyclonedds/jammy,now 0.9.1-1jammy.20230425.232107 amd64 [installed,automatic] 
ros-humble-rmw-cyclonedds-cpp/jammy,now 1.3.4-1jammy.20230426.033340 amd64 [installed]

That configuration is the same for the robot computer and the local one. Also, no XML config was used for dds.

nguyentuanHUST · 2023-07-25T06:59:00Z

Anyone knows how to fix this issue? I tried https://docs.ros.org/en/rolling/How-To-Guides/DDS-tuning.html but did not work.

eboasson · 2023-07-25T14:53:38Z

I'm also able to reproduce the issue with ddsperf.

ddsperf pub 30Hz size 3MB (terminal 1). 3MB is just an optional number, it is reproducible with lower values also, down to 500Kb in my case

ddsperf sub (terminal 2)

ddsperf sub (terminal 3), etc

But again that is only reproducible on the robot, but not locally.

Thank you for using ddsperf! 🥳 It also means I know the QoS settings and so I know it is using reliable communication.

My first assumption would always be what it says in https://docs.ros.org/en/rolling/How-To-Guides/DDS-tuning.html#cyclone-dds-tuning, because Cyclone defaults to requesting a 1MB socket receive buffer and so a 3MB size is not unlikely to overrun it. Of course a 500kB message would easily fit in 1MB of buffer space ...

Another thing that comes to mind is the possibility that it is using multicast on your "normal" machine, but unicast on the robot. There are a number of possible reasons for that, one of them is using WiFi interface vs a wired network interface. Without having tried it out, it seems self-evident that sending the data multiple times will slow it down.

If the publisher and the subscriber are both on the same machine, there is also the possibility that for one subscriber the kernel routes it over the loopback interface, but that from the second subscriber onwards it starts using multicast (if it deems the network interface suitable for that) and that this forces the process to slow down to the rate of the selected network interface. At least I suspect so ...

The unicast/multicast decisions are easily seen in a wireshark capture, but at least for me it would be more convenient if you could grab the actual configuration of Cyclone DDS from the system and paste the relevant bits here. That you can do by:

CYCLONEDDS_URI="<Tr><C>config</><Out>stdout</></>" ddsperf pub 30Hz size 3BM

and/or same for the subscriber. You'll see a list of all the settings (but given you're using default settings, that's probably not all that interesting) followed by a bit of information on network interfaces (the interesting part). E.g.:

1690295837.323599 [0]    2247770: started at 1690295837.06322266 -- 2023-07-25 16:37:17+02:00
1690295837.323643 [0]    2247770: udp initialized
1690295837.324441 [0]    2247770: interfaces: lo0 udp/127.0.0.1(q1) en0 wireless udp/192.168.2.11(q9)
1690295837.324453 [0]    2247770: en0: presumed flaky multicast, use for SPDP only
1690295837.324459 [0]    2247770: selected interfaces: en0 (index 12 priority 0 mc {spdp})
1690295837.324549 [0]    2247770: ownip: udp/192.168.2.11
1690295837.324552 [0]    2247770: extmask: invalid/0 (not applicable)
1690295837.324555 [0]    2247770: SPDP MC: udp/239.255.0.1
1690295837.324558 [0]    2247770: default MC: udp/239.255.0.1
1690295837.324560 [0]    2247770: SSM support included
1690295837.325207 [0]    2247770: socket receive buffer size set to 1048576 bytes
...
1690295837.325317 [0]    2247770: socket send buffer size set to 65536 bytes
1690295837.325333 [0]    2247770: interface en0: transmit port 49226

(See the fourth line I quoted: "presumed flaky multicast". This is indeed my laptops WiFi.) There is a lot more to be found if you put in trace rather than config, but that'll give you such much output you'll have to write it to disk and stop it before it fills up the disk. It does contain detailed information on a ton of things, including the addressing choices it makes, and so is relevant in principle, but let's not go there just yet.

Secondly, ddsperf has two cryptic options: -d NWIF and -X. The first takes the network interface for which you want the data rate logged. This is tricky, if I run ddsperf pub and ddsperf sub locally on my laptop, Cyclone will (correctly) claim to be using en0, but the unicast traffic is all over the loopback interface. So it might be fiddly. -X simply gives a some numbers from deep inside.

From pub + sub you get something like:

[55974] 3.005    30/s   7m |                              | 100%   8m
[55975] 3.005  size 3145728 total 90 lost 0 delta 29 lost 0 rate 0.03 kS/s 730.17 Mb/s (0.01 kS/s 226.60 Mb/s)
[55974] 3.005  rss:354.7MB vcsw:33 ivcsw:675 pub:11%+12%
[55975] 3.005  rss:343.4MB vcsw:51 ivcsw:47 recvUC:15%+11%
[55974] 3.005  lo0: xmit 761.32 Mb/s recv 761.32 Mb/s [3561167924 3561167924]
[55974] 3.005  discarded 0 rexmit 967680 Trexmit 9719375 Tthrottle 6205039 Nthrottle 18
[55975] 3.005  discarded 0 rexmit 0 Trexmit 0 Tthrottle 0 Nthrottle 0

(This is from a debug build on macOS.)

The lines with "discarded", "rexmit", etc. say something about what it is doing. The CPU loads tell you how hard much of the CPU it occupies. Together it tells a bit more than the normal output. I'd be curious to see what it says.

I don't know if these extra bits of information will be sufficient for figuring out the cause, but it is easy enough to be worth a try.

dimaxano · 2023-07-26T17:00:57Z

I tried running with extended tracing (<Tr><C>config</><Out>stdout</></>) and you were right!

My laptop's highest "quality" interface is wireless, so only spdp is enabled. But we have a bit trickier network setup on the robot (all connections are ethernet):

router with internet
         \
           \
               switch - -> computer with ros
            /        
          / 
 stm
board with micro-ros

So the "best" interface on the robot's computer is wired so multicast is used at full potential
But after I set AllowMulticast to spdp on the robot, large message problem disappeared 🎉

Thank you very much for your help!

The one thing I'm not sure about is how AllowMulticast == spdp influences communication in ROS2.

I started the software stack on the robot and I do not see any difference compared with full multicast, everything is working fine (including the second ROS2 device (stm) in the network). Maybe I'm just not fully understanding the purpose of multicast... What ROS-related features do I lose when using just spdp?

jaagut · 2023-07-26T19:15:45Z

Thank you @eboasson for your detailed tips.
@timonegk, I or others from @bit-bots will try your suggestions in the next few weeks and let you know the results.

eboasson · 2023-07-27T06:16:02Z

The one thing I'm not sure about is how AllowMulticast == spdp influences communication in ROS2.

It usually doesn't make a big difference, the spdp bit ensures all discovery still works without having to configure anything and without running into the dreaded #458 error. (As I noted there, I am working on that, I already know my plan works, but I have to divide my time so it doesn't move forward as quickly as I'd like.)

What you do get is that cases where data from one writer needs to go to many readers in different processes (also on the same node), then you get many unicast messages instead of one multicast, and there is a cross-over point where the multicast really is better despite bothering some processes with data they don't need.

That downside of sending data where it isn't needed can be pretty bad, and in the default configuration it is quite likely to happen for two reasons:

it defaults to using the one multicast address defined by the spec, which is also the address used for participant discovery ("SPDP") so all processes subscribe to them;
multicast happens it IP level, but the port numbers are it UDP, and so even with switches taking the IGMP protocol for maintaining multicast subscriptions on the network into account, if you are really just using domain 0 but there is someone else with a domain 1 process in the network, the multicasts would still go.

Back in the days that shared use of a network wasn't a thing, at least not in the world this protocol stack originated in. Sending it everywhere in practice also usually wasn't an issue, and a bit of configuration tweaking took care of the rest.

There is a solution to these things if you do want to use multicast for some topics. Instead of setting AllowMulticast to spdp, you can also set the default multicast address to 0.0.0.0 (assuming IPv4, and yes, it is a bit ugly to use a magic value like that), and then map some topics to "network partitions" for which you specify the multicast address to use.

Flova · 2024-04-02T08:04:28Z

I just tested it with normal multicast and spdp on our robot and the difference is quite significant. I think this is the issue.

Running ddsperf in the earlier described configuration results in 754.97 Mb/s with spdp independent of the subscriber number. Using normal multicast results in a drop-down to 25.17 Mb/s for all participants if a second subscriber is created.

All the traffic stays on the robot and a bridge interface is used as the network interface.

Thanks for your troubleshooting everybody and sorry for the late answer from us at @bit-bots.

Yadunund mentioned this issue Sep 1, 2023

CPU usage component container ros2/rclcpp#2295

Closed

jaagut mentioned this issue Sep 21, 2023

Remove temporary fix for ros bags once possible bit-bots/bitbots_main#266

Closed

Flova mentioned this issue Apr 5, 2024

Revert fix for wm 2023 and remove unused code bit-bots/bitbots_main#372

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequency drops with additional subscriber #461

Frequency drops with additional subscriber #461

timonegk commented Jul 4, 2023 •

edited

timonegk commented Jul 4, 2023

dimaxano commented Jul 20, 2023

dimaxano commented Jul 20, 2023

nguyentuanHUST commented Jul 25, 2023

eboasson commented Jul 25, 2023

dimaxano commented Jul 26, 2023

jaagut commented Jul 26, 2023 •

edited

eboasson commented Jul 27, 2023

Flova commented Apr 2, 2024 •

edited

Frequency drops with additional subscriber #461

Frequency drops with additional subscriber #461

Comments

timonegk commented Jul 4, 2023 • edited

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

timonegk commented Jul 4, 2023

dimaxano commented Jul 20, 2023

dimaxano commented Jul 20, 2023

nguyentuanHUST commented Jul 25, 2023

eboasson commented Jul 25, 2023

dimaxano commented Jul 26, 2023

jaagut commented Jul 26, 2023 • edited

eboasson commented Jul 27, 2023

Flova commented Apr 2, 2024 • edited

timonegk commented Jul 4, 2023 •

edited

jaagut commented Jul 26, 2023 •

edited

Flova commented Apr 2, 2024 •

edited