Next Hop-based routing with fallback to flooding #2856

GUVWAF · 2023-10-02T17:45:02Z

This adds a NextHopRouter which only relays if it is the next hop for a packet. The next hop is set by the current relayer of a packet, which bases this on information from either the NeighborInfoModule, or a previous successful delivery via flooding. Right now, it is only used for DMs and not used for broadcasts. Using the NeighborInfoModule, it can derive the next hop of neighbors and that of neighbors of neighbors. For others, it has no information in the beginning, which results into falling back to the FloodingRouter. Upon successful delivery via flooding, it updates the next hop of the recipient to the node that last relayed the ACK to us. When the ReliableRouter is doing retransmissions because it didn't get an ACK, for the last retry, it will reset the next hop, in order to fall back to the FloodingRouter.

Some implementation details: the next_hop and current_relayer are added to the unencrypted PacketHeader. Since we only used 14 bytes out of the 16-byte header (it got this size due to memory alignment), I used the 2 unused bytes to save the last byte of the NodeNum of the next hop and current relayer. This means that there is a chance that the last byte of two nodes match, and then they will both relay. However, that’s not a big issue as that would be similar to flooding. The chance that only the intended next hop relays depends on the amount of nodes that can hear the packet. If that are 10 nodes (that would be a lot), the chance that only the next hop relays is 83.7% ((255/256*254/256*253/256 ... 247/256*)*100), for 5 nodes the chance is 96.1% ((255/256*254/256*253/256*252/256)*100).
This means that there is actually no additional overhead, except that per node, the next_hop has to be stored in the NodeDB (only 4 bytes).

With this, you get rid of the unnecessary rebroadcasts caused by flooding like the one from node 0 as shown below. (Note that an arrow to a node means that it received it (so there might be multiple arrows for one packet), but not necessarily that it was addressed.)

With the NextHopRouter, node 0 doesn't try to relay:

Furthermore, it solves this issue of the current implementation where the wrong node (1, because it has the lowest SNR) is relaying:

With the NextHopRouter, because 0 knew 2 is a neighbor of 3, 2 was set as next hop and the DM to 3 correctly arrives, as well as the real ACK.

I'm hoping to get some feedback, and maybe new ideas on assigning the next hop from this. Also, I would love to get ideas on how to improve broadcasts as well.

loodydo · 2023-10-02T19:24:44Z

I have started looking. May take me some to to finish. I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

github-actions · 2023-10-03T13:15:39Z

🤖 Pull request artifacts

file	commit
`pr2856-firmware-2.2.10.44dc270.zip`	`44dc270`

Pull request: meshtastic/firmware#2856 Commit: meshtastic/firmware@3776064

GUVWAF · 2023-10-03T17:59:22Z

I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

It depends on the amount of nodes that can hear the packet (so the amount of immediate neighbors). But indeed, the chance that then only the intended next hop relays is a bit different than I presented before. I've updated the description with two calculations.

src/mesh/RadioInterface.h

src/mesh/NextHopRouter.cpp

GUVWAF · 2023-10-13T19:02:45Z

I pushed a new commit to resolve @loodydo’s comments. Also added the hop limit setting of the original transmitter to the header flags (using 3 of the 4 currently unused bits), such that we can determine the amount of hops the packet has already traveled. This comes in handy to set the next hop for immediate neighbors, and I think it’s also useful for displaying in the apps. Currently it will always show the SNR/RSSI of nodes in the list, but this is actually the SNR/RSSI to the last relayer. In cases that a node is not an immediate neighbor, we could display the hops towards that node.

I also changed the way how the next_hop and relay_node are stored in the MeshPacket and NodeDB. Now only the last byte is stored, since only that byte is sent over the air. Besides, it mitigates the case where you might assign the wrong NodeNum if you happen to have multiple nodes with the same last byte in the NodeDB.

Pull request: meshtastic/firmware#2856 Commit: meshtastic/firmware@44dc270

…inal transmitter

GUVWAF · 2023-11-26T22:14:30Z

I think this PR is ready to merge, except that it’s a breaking change unfortunately.
I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message.
We’ll have to wait for 3.0 with this.

@S5NC

As per an idea of @S5NC

src/mesh/Router.cpp

src/mesh/ReliableRouter.cpp

roha-github · 2024-05-07T20:34:15Z

If you use 1 byte, then I suggest modulo 251 of the whole name represented by UINT_64. The probability is then 0.4%

GUVWAF · 2024-05-07T20:53:02Z

@roha-github What do you mean with "whole name represented by UINT_64" and which probability are you referring to?

roha-github · 2024-05-07T21:10:34Z

@GUVWAF

"I used the 2 unused bytes to save the last byte of the NodeNum of the next hop and current relayer" ...

don't "last byte of the NodeNum", use NodeNum modulo 251 (max 1 byte prim number) as filter

GUVWAF · 2024-05-07T21:19:47Z

The NodeNum is only 32 bits and based on the MAC address of the device.
Assuming that that number is uniformly distributed I don't get why using modulo of the largest prime number is a better approach.

Where does your 0.4% probability refer to?

roha-github · 2024-05-07T21:23:44Z

UINT32 modulo 251 = 0..250
1/251 = 0.4%
Using prim numbers for checksum is a common pattern

GUVWAF · 2024-05-07T21:31:47Z

I'm sorry but I don't get why that is better than using the last byte, which is 0..255.

It's not a checksum, it's to uniquely identify a node. If the NodeNum is already uniformly distributed, taking the last byte should not have any bias.

roha-github · 2024-05-07T21:50:38Z

About the routing itself ...

Each node could regularly publish a broadcast of its direct neighbors and communicate the reception quality of RSSI and SNR.

The list is sorted by reception quality and only the best 32 nodes are published.

1 byte or 8 bits are used per node:
Bit 1 = RSSI quality > -115 dB
Bit 2 = SNR quality > -11 dB
Bit 3..8 = NodeNum mod 64

Together with the node DB and the GPS coordinates, either the node itself or the app (smartphone, web browser, CLI) can then create a network topology.

Only 7 bytes would be required for a route with 7 hops

caveman99 · 2024-05-07T21:56:08Z

That's what the neighborinfo module does.

roha-github · 2024-05-07T22:14:51Z

That's what the neighborinfo module does.

Ok, now I understand the benefits of the Neighbor Info Module.

Should the node do the routing itself later? Does the node DB then store all the neighbors of each node? How much memory would this require? I once read somewhere that the node DB would only store up to 80 nodes.

The documentation on the Neighbor Info Module does not really encourage activation. The module would be needed for routing and ever larger mesh networks.

tve · 2024-05-19T05:49:15Z

I think this PR is ready to merge, except that it’s a breaking change unfortunately.
I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message.
We’ll have to wait for 3.0 with this.

I do not understand the "wait for 3.0", most likely I don't understand what deploying 3.0 means... I would assume that nodes in a mesh will be upgraded slowly, especially big meshes, like SoCalMesh with hundreds of nodes, some in pretty inaccessible locations. This means that whether it's called 3.0 or not the next hop routing will have to co-exist with 2.x. Unless you count on all the 2.X having been upgraded to a version that sets next_hop to zero by then...

I'm wondering whether it would be possible to use the NeighborInfo to gate next-hop routing. If the NeighborInfo can have a flag that says "I do next-hop routing" then on the outgoing side a relayer knows whether populating next-hop makes sense or not.

On the incoming side it's a bit trickier in that a relayer would look up the NeighborInfo based on relay_node. If relay_node matches a NeighborInfo that says it does next-hop routing then the relayer would obey next_hop, although of course the match may be by chance. As a special case, if hop_start is set and equal to hop_limit then IIUC the message is received from the original sender and thus NeighborInfo can be looked up using the from address.

In any case, it seems worth it to figure out some heuristics to make 3.0 interoperable with pre-2.3, which introduced the explicit next_hop field as far as I can tell and not rely on "it's going to take so long for 3.0 to become available that everyone will have upgraded past 2.3 by then"...

Apologies if I'm completely off-base and don't understand some key fact here.

tve · 2024-05-19T05:53:45Z

Shouldn't the NeighborInfo module be enabled by default?

GUVWAF · 2024-05-19T13:30:26Z

Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast) or another sync word to deliberately make nodes incompatible.

I'm not really sure half-working interoperable solutions are the way to go here, especially if it requires the NeighborInfo module to be enabled.
It's not that this PR has a huge impact anyway, as it only applies to DMs while I think the majority of packets are still broadcasts.

NeighborInfo creates quite a lot of additional traffic and is not necessary for NextHop routing, so I don't think it should be enabled by default.

KyleMaas · 2024-05-19T18:37:18Z

Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast) or another sync word to deliberately make nodes incompatible.

@GUVWAF I'd like to chime in here and say that I think this part has some major downsides. This threatens to split deployed infrastructure between old and new and drop coverage levels for both old and new until everything, including all client devices, have been changed over. For anyone who has gone further with deployment than just tinkering, this could be a major undertaking just due to the number of devices. And for anyone with infrastructure on mountaintops or other places which are difficult to access, these devices can be a pain in the neck to try to update. The more people using the same settings, the better it is for everyone's coverage, and the experience for new users would be better if they already have an existing network to connect to without having to build it all out themselves. Changing the default modem preset should not be done lightly.

And I agree with @tve's comments about version interoperability. A thoughtful and gradual upgrade path would make it far more likely for people to actually upgrade. Ideally you want people to upgrade to the new version so that you can continue to innovate moving forward. If the changes are too drastic, you risk having people stuck with "good enough, works for me" instead, or even potentially having someone maintain an old compatible version as an LTS release. Or looking at the project/protocol as so unstable that it's not worth continuing to deploy.

I love the idea of this, but a gradual rollout would make it a much easier pill to swallow.

GUVWAF · 2024-05-19T18:48:38Z

Ofcourse the decision will not be taken lightly and people have been saying it will be happing at least a year from now.

I'm not entirely following why you're advocating for gradual changes, as precisely that leads to "good enough, works for me", instead of doing all breaking changes at once, where you have to upgrade to continue to talk to others that have.

garthvh · 2024-05-19T19:15:06Z

I love the idea of this, but a gradual rollout would make it a much easier pill to swallow.

This pull is not merged because it is a breaking change that needs to wait until 3.0, and there is not enough other stuff to make a breaking change worth it. There is no way to do this gradually. Meshtastic is just not set up for multiple versions, when version 3 gets promoted version 2 is going to be deprecated.

tve · 2024-05-19T19:34:22Z

There is no way to do this gradually.

I'm not convinced, that's why I made suggestions above. Yes, there's no way to make the changes without compromises, but I don't know that the compromises have been explored or discussed. Quite possibly just my ignorance.

Meshtastic is just not set up for multiple versions, when version 3 gets promoted version 2 is going to be deprecated.

This is one of the biggest issues I have with the proposal: it uses up all the unused bits for its purposes and doesn't fix the versioning issue which causes the incompatibility nightmare. This kind'a guarantees that no more non-breaking changes can be made going forward which I'd rate as "not good".

At a high level I'm wondering what the plan is for wide-area meshes, which is where the routing improvements have the most impact I believe. I'm part of the southern california mesh with hundreds of nodes spread over hundreds of square km, many mountain ranges dividing the geography, and many nodes in difficult to access/update places. My overall assessment is that the mesh as a whole is pretty unusable other than to toy around and experiment (which is cool). If the goal is to make this type of mesh usable over longer distances (like sending messages from Santa Barbara to San Diego -- approx 350km) some serious thought needs to go into routing. Maybe the answer is that this should be done via MQTT (internet), maybe the answer is that there should be an overlay backbone network to bridge long distances, maybe the answer lies in just improving the current type of routing. I don't know and it also depends a lot on what the project goals are.

I'm not sure what the best place for this type of discussion is, but I believe it's relevant here because if there's going to be a breaking change then at least make the most out of it. I'm also not trying to piss on this PR: I actually think it's great, I'm just wondering about the bigger picture.

GUVWAF · 2024-05-19T23:42:50Z

This is one of the biggest issues I have with the proposal: it uses up all the unused bits for its purposes and doesn't fix the versioning issue which causes the incompatibility nightmare.

But this is really beyond the scope of this PR. Currently these two bytes are useless anyway, because before 2.3 they could take any value depending on what was in that memory location. This is just one of the breaking changes currently proposed for 3.0 (fixing text compression is another one), I believe a change to the header to make it future-proof deserves its own PR. Edit: I now see @KyleMaas already opened a discussion for this: #3932

My overall assessment is that the mesh as a whole is pretty unusable other than to toy around and experiment (which is cool).

What kind of issues are you having and what makes you believe this has to do with routing? For broadcasts the current managed flooding approach is really not too bad. I'm not convinced there even exists a protocol that's guaranteed to be better in all the different use cases for Meshtastic (especially fixed or mobile nodes, or a mix), that doesn't require an internet connection and can run on microprocessors with limited RAM. For "smarter" routing you need periodic control messages, but those eat up the same bandwidth as regular traffic.

Other problems with wireless mesh networks, e.g. the hidden node problem or unstable/changing RF conditions are not solved with better routing.

KyleMaas · 2024-05-20T01:28:38Z

@GUVWAF Yeah, starting that discussion to ask about forward compatibility is what led me into joining the discussion on this issue. I have several places in mind to add nodes that would vastly increase coverage in my area, but they are inaccessible enough that once they're up there, I'm probably realistically only going to be able to update the firmware maybe every 12 months or so. So I need to wait for a version of the firmware that has enough interoperability, or have all of the proposed protocol changes interoperable enough with older versions, that I can actually make these nodes available to the community. All of the current nodes I run are in areas which are at least reasonably accessible. But there's no sense in me taking on the time, effort, and expense of building out the expansions if it's going to be completely obsolete in a month - it makes more sense to wait for stability and not recommend Meshtastic to anyone who isn't already in our local coverage zones. So protocol stability and interoperability is extremely important to me.

garthvh · 2024-05-20T02:00:04Z

3.0 is like 18 months away best case. Honestly not really enough there yet to even consider it.

roha-github · 2024-05-20T06:53:26Z

@GUVWAF "Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast)"

Where are such far-reaching decisions discussed and decided? What exactly is the background? The minimal increase in bandwidth is "bought" with a loss of range and all 2.x routers & repeaters.

Alternatively, 3.x clients switch to Long / Fast or both networks run in parallel and apart.

GUVWAF · 2024-05-20T12:57:42Z

Where are such far-reaching decisions discussed and decided?

This was discussed by admins on Discord, but is also just an idea right now. When 3.0 comes closer, there will likely be an RFC for this.
Let's now continue this discussion in #3932.

Initial version of NextHopRouter

0d6729b

Merge branch 'master' into NextHopRouter

3776064

thebentern added a commit to meshtastic/artifacts that referenced this pull request Oct 3, 2023

Upload pr2856-firmware-2.2.10.3776064.zip (37760)

a82b183

Pull request: meshtastic/firmware#2856 Commit: meshtastic/firmware@3776064

loodydo reviewed Oct 4, 2023

View reviewed changes

src/mesh/RadioInterface.h Outdated Show resolved Hide resolved

loodydo reviewed Oct 9, 2023

View reviewed changes

src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved

loodydo reviewed Oct 9, 2023

View reviewed changes

src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved

loodydo reviewed Oct 9, 2023

View reviewed changes

src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved

loodydo reviewed Oct 10, 2023

View reviewed changes

src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved

Set original hop limit in header flags

44dc270

thebentern added a commit to meshtastic/artifacts that referenced this pull request Oct 13, 2023

Upload pr2856-firmware-2.2.10.44dc270.zip (44dc2)

8abab90

Pull request: meshtastic/firmware#2856 Commit: meshtastic/firmware@44dc270

GUVWAF and others added 5 commits October 14, 2023 17:17

Short-circuit to FloodingRouter for broadcasts

9b1dd75

If packet traveled 1 hop, set relay_node as next_hop for the orig…

27a492a

…inal transmitter

Merge branch 'master' into NextHopRouter

c7293cf

Merge branch 'master' into NextHopRouter

42757d8

Merge branch 'master' into NextHopRouter

81f57b6

GUVWAF added 2 commits November 27, 2023 20:36

Set last byte to 0xFF if it ended at 0x00

ef2c6ee

As per an idea of @S5NC

Also update next-hop based on received DM for us

3ba9ecb

S5NC reviewed Nov 28, 2023

View reviewed changes

src/mesh/Router.cpp Show resolved Hide resolved

GUVWAF mentioned this pull request Feb 17, 2024

[Feature Request]: Range check and mesh complexity: Proposed Backwards compatible Meshtastic protocol improvement #3239

Closed

thebentern mentioned this pull request Mar 3, 2024

[Feature Request]: Reduce wasted bandwidth by ensuring that packets propagate in the right direction #3326

Closed

thebentern added the 3.0 label Mar 16, 2024

GUVWAF mentioned this pull request Mar 23, 2024

Set unused header bytes to zero for future use #3479

Merged

GUVWAF and others added 2 commits April 20, 2024 13:35

Resolve conflicts (needs testing)

25ec051

Merge branch 'master' into NextHopRouter

b456e34

mcumings reviewed Apr 25, 2024

View reviewed changes

src/mesh/ReliableRouter.cpp Show resolved Hide resolved

Merge branch 'master' into NextHopRouter

d4ef0cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next Hop-based routing with fallback to flooding #2856

Next Hop-based routing with fallback to flooding #2856

GUVWAF commented Oct 2, 2023 •

edited

loodydo commented Oct 2, 2023

github-actions bot commented Oct 3, 2023 •

edited

GUVWAF commented Oct 3, 2023

GUVWAF commented Oct 13, 2023

GUVWAF commented Nov 26, 2023

roha-github commented May 7, 2024

GUVWAF commented May 7, 2024

roha-github commented May 7, 2024

GUVWAF commented May 7, 2024

roha-github commented May 7, 2024

GUVWAF commented May 7, 2024

roha-github commented May 7, 2024 •

edited

caveman99 commented May 7, 2024

roha-github commented May 7, 2024

tve commented May 19, 2024

tve commented May 19, 2024

GUVWAF commented May 19, 2024 •

edited

KyleMaas commented May 19, 2024

GUVWAF commented May 19, 2024

garthvh commented May 19, 2024

tve commented May 19, 2024

GUVWAF commented May 19, 2024 •

edited

KyleMaas commented May 20, 2024

garthvh commented May 20, 2024

roha-github commented May 20, 2024

GUVWAF commented May 20, 2024

Next Hop-based routing with fallback to flooding #2856

Are you sure you want to change the base?

Next Hop-based routing with fallback to flooding #2856

Conversation

GUVWAF commented Oct 2, 2023 • edited

loodydo commented Oct 2, 2023

github-actions bot commented Oct 3, 2023 • edited

🤖 Pull request artifacts

GUVWAF commented Oct 3, 2023

GUVWAF commented Oct 13, 2023

GUVWAF commented Nov 26, 2023

roha-github commented May 7, 2024

GUVWAF commented May 7, 2024

roha-github commented May 7, 2024

GUVWAF commented May 7, 2024

roha-github commented May 7, 2024

GUVWAF commented May 7, 2024

roha-github commented May 7, 2024 • edited

caveman99 commented May 7, 2024

roha-github commented May 7, 2024

tve commented May 19, 2024

tve commented May 19, 2024

GUVWAF commented May 19, 2024 • edited

KyleMaas commented May 19, 2024

GUVWAF commented May 19, 2024

garthvh commented May 19, 2024

tve commented May 19, 2024

GUVWAF commented May 19, 2024 • edited

KyleMaas commented May 20, 2024

garthvh commented May 20, 2024

roha-github commented May 20, 2024

GUVWAF commented May 20, 2024

GUVWAF commented Oct 2, 2023 •

edited

github-actions bot commented Oct 3, 2023 •

edited

roha-github commented May 7, 2024 •

edited

GUVWAF commented May 19, 2024 •

edited

GUVWAF commented May 19, 2024 •

edited