Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batman-adv: Multicast eventually not forwarded from/to clients on supernodes on v2023.x? #3059

Open
7 tasks
maurerle opened this issue Nov 8, 2023 · 2 comments

Comments

@maurerle
Copy link
Member

maurerle commented Nov 8, 2023

Bug report

In FFAC I experienced broken ipv6 neighbor discovery after some time for predominantly wifi clients on gluon v2023.x (not entirely sure about affected versions - see below).

First I suspected a wrong supernode configuration being the case.
But this issue also appeared on the local mesh ipv6 (being fdac::/64 in FFAC), inbetween the mesh.
Pinging fe80: addresses from the supernode to the client did work fine though.

The IP-Addresses of the Gluon-Nodes themselves have never been affected yet.
I could also reproduce this issue on a client behind a gluon v2022.1.x - connected to a supernode running batman-adv 2023.x (v2023.0 or v2023.2 don't remember it)

It does not affect IPv4 traffic, only ipv6 neighbor discovery (neighbor advertisement/soliciation).
This results in broken ipv6 for clients, which leads to a buggy web connection.

An eventually related problem could be #2854
I did not see Neighbor discovery packages of the router sent to the client on the gluon-node in bat0 though. So this looks differently (or should I have looked into local-node/any?).

Fixes - Disabling multicast optimizations

This issue has been discussed on IRC. @T-X asked me to disable multicast optimizations in batman using batctl multicast_forceflood 1 which did help.

To get it working throughout the mesh domain I had to set multicast_forceflood on all nodes inbetween to receive the multicast from my laptop

  1. I could only ping the client's fdac:: from its nextnode
  2. I did set multicast_forceflood on its nextnode (not sure if required)
  3. I did set multicast_forceflood on the mesh-vpn node connected through mesh-lan
  4. Now I could also ping client's fdac:: from the mesh-vpn node - but not from my laptop
  5. the supernode already had multicast_forceflood set
  6. after I did set multicast_forceflood on my nextnode too - I could ping the clients fdac::

Reproducing

To try to reproduce, I would use a >=v2023.1 node, connect through wifi as client and ping ipv6 addresses or fdac addresses of other nodes in the same mesh domain/segment. If this works throughout multiple >3 days, I don't think you have this issue.
It helps to use freifunk as the main connection while being in home-office, to monitor that.

Affected Versions/Devices

I could not yet fix this to an affected (batman-adv) version - as the supernode might have been the cause when I reproduced this on gluon v2022.x nodes - and the gluon-node with batman-adv v2023.x might have been the cause when the supernode had v2022.
But I suspect it to be somewhat batman-adv v2023.x related 🤷

It has been reproduced on Supernodes running Debian 11/12, batman-adv (v2022.0)/v2023.0/v2023.2
It has been seen on gluon (v2022.x), v2023.1, v2023.1.1, master
It has been seen on FireTV stick, Debian 12 Laptop, Samsung phone and others.

Somehow no other community did yet report similar issues.

What is the expected behaviour?
IPv6 neighbor discovery should be working on clients

Gluon Version:
v2023.x

Site Configuration:
https://github.com/ffac/site

Next Steps - Further investigation

This issue is used to publicly track my issue and document further advancements.

Things to do when this issue can be seen:

  • check that for each client's unicast IPv6 address the gluon-node should have an according, mapped IPv6 solicited-node multicast address. and then again a mapped ethernet multicast address from that
    • this multicast ethernet address should always be visible in batman-adv's translation table when the client is online. batctl tg -m on remote nodes and batctl tl -m on the node serving this client.
    • as long as a client device does not use the IPv6 privacy extension it should be recognizable as something like 33:33:ff:<last-3-bytes-of-unicast-MAC>
  • Check if bridge wakeup-call feature is disabled as no special ICMPv6 echo requests or MLD unicast queries were seen in pcap dump (then this would only affect clients on wifi? - (I am not sure if this is plausible as I had to set multicast_forceflood on all nodes inbetween):
    • check on Gluon v2023.1 that one sees the special ICMPv6 echo request + reply from/to the bridge to/from a wifi client device with an echo request/reply identifier of 0xEC6B
    • and then a unicasted MLD query from the bridge in response to the received echo request. and then again a response to that from the client device with a multicasted MLD report
  • narrow down affected batman-adv/gluon versions
@T-X
Copy link
Contributor

T-X commented Nov 15, 2023

Thanks for the report! I really appreciate these investigations. Especially as the next mechanism regarding multicast is going to land in batman-adv soon it is a good idea to double check the current mechanism.

tl;dr: Couldn't reproduce the issue yet, could need further guidance.

I've flashed the latest stable Freifunk Aachen firmware on a CPE210 (gluon-ffac-v2023.1.1-1-tp-link-cpe210-v1-sysupgrade.bin) and performed the following tests:

bridge multicast wakeup call feature/workaround:

Seems to work as expected, I see the additional packet exchange (a special ICMPv6 echo request/reply and a subsequent unicasted MLD query) between the CPE210 and an Android 9 phone.

Multicasted ICMPv6 Echo request:

I've tested sending 300 ICMPv6 echo request via multicast from my Linux/Debian laptop to the IPv6 solicited-node multicast address of the (default) gateway I got assigned to over 25 minutes over WiFi:

$ ping6 -c 300 -i 5 ff02::1:ffac:2020%wlp1s0
...
--- ff02::1:ffac:2020%wlp1s0 ping statistics ---
300 packets transmitted, 300 received, 0% packet loss, time 1496283ms
rtt min/avg/max/mdev = 23.052/38.048/1225.936/73.173 ms

All received a valid response.

Wireshark I/O graph:

icmpv6-mc-echo-1500s

pcapng capture file: icmpv6-mc-echo-1500s pcapng not-a-png
pcap capture filter: (icmp6[icmp6type] = icmp6-echo and ip6 dst ff02::1:ffac:2020 and ether src 0e:b8:f7:78:c3:7c) or (icmp6[icmp6type] = icmp6-echoreply and ether src 88:e6:ff:ac:20:20)

Black line, at 2: Number of packets per 5 seconds
Green line, at 1: Number of captured echo requests per 5 seconds
Red squares, at 1: Number of captured echo replies per 5 seconds

So seems to look fine at least in this scenario?

ICMPv6 Neighbor Discovery / Solicitation:

I've then also tested sending 300 ICMPv6 Neighbor Solicitations via multicast from my Linux/Debian laptop to the IPv6 solicited-node multicast address of the (default) gateway I got assigned to over 25 minutes over WiFi. I've used the ipv6toolkit for that.

$ ./ns6 -i wlp1s0 -l -z 5 -t fe80::8ae6:ffff:feac:2020 -d ff02::1:ffac:2020 -s fe80::cb8:f7ff:fe78:c37c -S 0e:b8:f7:78:c3:7c

All except one neighbor solicitation received a response via an ICMPv6 Neighbor Advertisement:

icmpv6-nd-1500s

pcapng capture file: icmpv6-nd-1500s pcapng not-a-png
pcap capture filter: (icmp6[icmp6type] = icmp6-neighborsolicit and ip6 dst ff02::1:ffac:2020 and ether src 0e:b8:f7:78:c3:7c) or (icmp6[icmp6type] = icmp6-neighboradvert and ether src 88:e6:ff:ac:20:20)

Black line, at 2: Number of packets per 5 seconds
Green line, at 1: Number of captured neighbor solicitations per 5 seconds
Red squares, at 1: Number of captured neighbor advertisements per 5 seconds

So seems to look fine at least in this scenario?


Hence, I could need a little more guidance on how to reproduce the issue.

@maurerle
Copy link
Member Author

maurerle commented Dec 18, 2023

Last week, someone from our community reported this issue again, and I had broken IPv6 due to this too.
The Supernode did not have multicast_forceflood for this specific mesh cloud set anymore (a relict from the last debug session).
As I needed to go to a meeting, I only did set batctl bat20 multicast_forceflood 1 on the supernode, confirmed it is working and called it a day.

I tried to debug it with a separate supernode, but somehow could not reproduce it for a few days, as I probably need a larger mesh cloud size than 2 (which is hard without resulting in people being affected..)
But this issue is still relevant, and I have it on mind, but currently no time for it.

And I am happy, that I now know a functional workaround

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants