Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next Hop-based routing with fallback to flooding #2856

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from
Draft

Conversation

GUVWAF
Copy link
Member

@GUVWAF GUVWAF commented Oct 2, 2023

This adds a NextHopRouter which only relays if it is the next hop for a packet. The next hop is set by the current relayer of a packet, which bases this on information from either the NeighborInfoModule, or a previous successful delivery via flooding. Right now, it is only used for DMs and not used for broadcasts. Using the NeighborInfoModule, it can derive the next hop of neighbors and that of neighbors of neighbors. For others, it has no information in the beginning, which results into falling back to the FloodingRouter. Upon successful delivery via flooding, it updates the next hop of the recipient to the node that last relayed the ACK to us. When the ReliableRouter is doing retransmissions because it didn't get an ACK, for the last retry, it will reset the next hop, in order to fall back to the FloodingRouter.

Some implementation details: the next_hop and current_relayer are added to the unencrypted PacketHeader. Since we only used 14 bytes out of the 16-byte header (it got this size due to memory alignment), I used the 2 unused bytes to save the last byte of the NodeNum of the next hop and current relayer. This means that there is a chance that the last byte of two nodes match, and then they will both relay. However, that’s not a big issue as that would be similar to flooding. The chance that only the intended next hop relays depends on the amount of nodes that can hear the packet. If that are 10 nodes (that would be a lot), the chance that only the next hop relays is 83.7% ((255/256*254/256*253/256 ... 247/256*)*100), for 5 nodes the chance is 96.1% ((255/256*254/256*253/256*252/256)*100).
This means that there is actually no additional overhead, except that per node, the next_hop has to be stored in the NodeDB (only 4 bytes).

With this, you get rid of the unnecessary rebroadcasts caused by flooding like the one from node 0 as shown below. (Note that an arrow to a node means that it received it (so there might be multiple arrows for one packet), but not necessarily that it was addressed.)
image
With the NextHopRouter, node 0 doesn't try to relay:
image

Furthermore, it solves this issue of the current implementation where the wrong node (1, because it has the lowest SNR) is relaying:
image
With the NextHopRouter, because 0 knew 2 is a neighbor of 3, 2 was set as next hop and the DM to 3 correctly arrives, as well as the real ACK.
image

I'm hoping to get some feedback, and maybe new ideas on assigning the next hop from this. Also, I would love to get ideas on how to improve broadcasts as well.

@loodydo
Copy link
Contributor

loodydo commented Oct 2, 2023

I have started looking. May take me some to to finish. I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2023

🤖 Pull request artifacts

file commit
pr2856-firmware-2.2.10.44dc270.zip 44dc270

thebentern added a commit to meshtastic/artifacts that referenced this pull request Oct 3, 2023
@GUVWAF
Copy link
Member Author

GUVWAF commented Oct 3, 2023

I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

It depends on the amount of nodes that can hear the packet (so the amount of immediate neighbors). But indeed, the chance that then only the intended next hop relays is a bit different than I presented before. I've updated the description with two calculations.

src/mesh/RadioInterface.h Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
@GUVWAF
Copy link
Member Author

GUVWAF commented Oct 13, 2023

I pushed a new commit to resolve @loodydo’s comments. Also added the hop limit setting of the original transmitter to the header flags (using 3 of the 4 currently unused bits), such that we can determine the amount of hops the packet has already traveled. This comes in handy to set the next hop for immediate neighbors, and I think it’s also useful for displaying in the apps. Currently it will always show the SNR/RSSI of nodes in the list, but this is actually the SNR/RSSI to the last relayer. In cases that a node is not an immediate neighbor, we could display the hops towards that node.

I also changed the way how the next_hop and relay_node are stored in the MeshPacket and NodeDB. Now only the last byte is stored, since only that byte is sent over the air. Besides, it mitigates the case where you might assign the wrong NodeNum if you happen to have multiple nodes with the same last byte in the NodeDB.

thebentern added a commit to meshtastic/artifacts that referenced this pull request Oct 13, 2023
@GUVWAF
Copy link
Member Author

GUVWAF commented Nov 26, 2023

I think this PR is ready to merge, except that it’s a breaking change unfortunately.
I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message.
We’ll have to wait for 3.0 with this.

@roha-github
Copy link

If you use 1 byte, then I suggest modulo 251 of the whole name represented by UINT_64. The probability is then 0.4%

@GUVWAF
Copy link
Member Author

GUVWAF commented May 7, 2024

@roha-github What do you mean with "whole name represented by UINT_64" and which probability are you referring to?

@roha-github
Copy link

@GUVWAF

"I used the 2 unused bytes to save the last byte of the NodeNum of the next hop and current relayer" ...

don't "last byte of the NodeNum", use NodeNum modulo 251 (max 1 byte prim number) as filter

@GUVWAF
Copy link
Member Author

GUVWAF commented May 7, 2024

The NodeNum is only 32 bits and based on the MAC address of the device.
Assuming that that number is uniformly distributed I don't get why using modulo of the largest prime number is a better approach.

Where does your 0.4% probability refer to?

@roha-github
Copy link

UINT32 modulo 251 = 0..250
1/251 = 0.4%
Using prim numbers for checksum is a common pattern

@GUVWAF
Copy link
Member Author

GUVWAF commented May 7, 2024

I'm sorry but I don't get why that is better than using the last byte, which is 0..255.

It's not a checksum, it's to uniquely identify a node. If the NodeNum is already uniformly distributed, taking the last byte should not have any bias.

@roha-github
Copy link

roha-github commented May 7, 2024

About the routing itself ...

Each node could regularly publish a broadcast of its direct neighbors and communicate the reception quality of RSSI and SNR.

The list is sorted by reception quality and only the best 32 nodes are published.

1 byte or 8 bits are used per node:
Bit 1 = RSSI quality > -115 dB
Bit 2 = SNR quality > -11 dB
Bit 3..8 = NodeNum mod 64

Together with the node DB and the GPS coordinates, either the node itself or the app (smartphone, web browser, CLI) can then create a network topology.

Only 7 bytes would be required for a route with 7 hops

@caveman99
Copy link
Sponsor Member

That's what the neighborinfo module does.

@roha-github
Copy link

That's what the neighborinfo module does.

Ok, now I understand the benefits of the Neighbor Info Module.

Should the node do the routing itself later? Does the node DB then store all the neighbors of each node? How much memory would this require? I once read somewhere that the node DB would only store up to 80 nodes.

The documentation on the Neighbor Info Module does not really encourage activation. The module would be needed for routing and ever larger mesh networks.

@tve
Copy link

tve commented May 19, 2024

I think this PR is ready to merge, except that it’s a breaking change unfortunately.
I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message.
We’ll have to wait for 3.0 with this.

I do not understand the "wait for 3.0", most likely I don't understand what deploying 3.0 means... I would assume that nodes in a mesh will be upgraded slowly, especially big meshes, like SoCalMesh with hundreds of nodes, some in pretty inaccessible locations. This means that whether it's called 3.0 or not the next hop routing will have to co-exist with 2.x. Unless you count on all the 2.X having been upgraded to a version that sets next_hop to zero by then...

I'm wondering whether it would be possible to use the NeighborInfo to gate next-hop routing. If the NeighborInfo can have a flag that says "I do next-hop routing" then on the outgoing side a relayer knows whether populating next-hop makes sense or not.

On the incoming side it's a bit trickier in that a relayer would look up the NeighborInfo based on relay_node. If relay_node matches a NeighborInfo that says it does next-hop routing then the relayer would obey next_hop, although of course the match may be by chance. As a special case, if hop_start is set and equal to hop_limit then IIUC the message is received from the original sender and thus NeighborInfo can be looked up using the from address.

In any case, it seems worth it to figure out some heuristics to make 3.0 interoperable with pre-2.3, which introduced the explicit next_hop field as far as I can tell and not rely on "it's going to take so long for 3.0 to become available that everyone will have upgraded past 2.3 by then"...

Apologies if I'm completely off-base and don't understand some key fact here.

@tve
Copy link

tve commented May 19, 2024

Shouldn't the NeighborInfo module be enabled by default?

@GUVWAF
Copy link
Member Author

GUVWAF commented May 19, 2024

Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast) or another sync word to deliberately make nodes incompatible.

I'm not really sure half-working interoperable solutions are the way to go here, especially if it requires the NeighborInfo module to be enabled.
It's not that this PR has a huge impact anyway, as it only applies to DMs while I think the majority of packets are still broadcasts.

NeighborInfo creates quite a lot of additional traffic and is not necessary for NextHop routing, so I don't think it should be enabled by default.

@KyleMaas
Copy link

Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast) or another sync word to deliberately make nodes incompatible.

@GUVWAF I'd like to chime in here and say that I think this part has some major downsides. This threatens to split deployed infrastructure between old and new and drop coverage levels for both old and new until everything, including all client devices, have been changed over. For anyone who has gone further with deployment than just tinkering, this could be a major undertaking just due to the number of devices. And for anyone with infrastructure on mountaintops or other places which are difficult to access, these devices can be a pain in the neck to try to update. The more people using the same settings, the better it is for everyone's coverage, and the experience for new users would be better if they already have an existing network to connect to without having to build it all out themselves. Changing the default modem preset should not be done lightly.

And I agree with @tve's comments about version interoperability. A thoughtful and gradual upgrade path would make it far more likely for people to actually upgrade. Ideally you want people to upgrade to the new version so that you can continue to innovate moving forward. If the changes are too drastic, you risk having people stuck with "good enough, works for me" instead, or even potentially having someone maintain an old compatible version as an LTS release. Or looking at the project/protocol as so unstable that it's not worth continuing to deploy.

I love the idea of this, but a gradual rollout would make it a much easier pill to swallow.

@GUVWAF
Copy link
Member Author

GUVWAF commented May 19, 2024

Ofcourse the decision will not be taken lightly and people have been saying it will be happing at least a year from now.

I'm not entirely following why you're advocating for gradual changes, as precisely that leads to "good enough, works for me", instead of doing all breaking changes at once, where you have to upgrade to continue to talk to others that have.

@garthvh
Copy link
Member

garthvh commented May 19, 2024

I love the idea of this, but a gradual rollout would make it a much easier pill to swallow.

This pull is not merged because it is a breaking change that needs to wait until 3.0, and there is not enough other stuff to make a breaking change worth it. There is no way to do this gradually. Meshtastic is just not set up for multiple versions, when version 3 gets promoted version 2 is going to be deprecated.

@tve
Copy link

tve commented May 19, 2024

There is no way to do this gradually.

I'm not convinced, that's why I made suggestions above. Yes, there's no way to make the changes without compromises, but I don't know that the compromises have been explored or discussed. Quite possibly just my ignorance.

Meshtastic is just not set up for multiple versions, when version 3 gets promoted version 2 is going to be deprecated.

This is one of the biggest issues I have with the proposal: it uses up all the unused bits for its purposes and doesn't fix the versioning issue which causes the incompatibility nightmare. This kind'a guarantees that no more non-breaking changes can be made going forward which I'd rate as "not good".

At a high level I'm wondering what the plan is for wide-area meshes, which is where the routing improvements have the most impact I believe. I'm part of the southern california mesh with hundreds of nodes spread over hundreds of square km, many mountain ranges dividing the geography, and many nodes in difficult to access/update places. My overall assessment is that the mesh as a whole is pretty unusable other than to toy around and experiment (which is cool). If the goal is to make this type of mesh usable over longer distances (like sending messages from Santa Barbara to San Diego -- approx 350km) some serious thought needs to go into routing. Maybe the answer is that this should be done via MQTT (internet), maybe the answer is that there should be an overlay backbone network to bridge long distances, maybe the answer lies in just improving the current type of routing. I don't know and it also depends a lot on what the project goals are.

I'm not sure what the best place for this type of discussion is, but I believe it's relevant here because if there's going to be a breaking change then at least make the most out of it. I'm also not trying to piss on this PR: I actually think it's great, I'm just wondering about the bigger picture.

@GUVWAF
Copy link
Member Author

GUVWAF commented May 19, 2024

This is one of the biggest issues I have with the proposal: it uses up all the unused bits for its purposes and doesn't fix the versioning issue which causes the incompatibility nightmare.

But this is really beyond the scope of this PR. Currently these two bytes are useless anyway, because before 2.3 they could take any value depending on what was in that memory location. This is just one of the breaking changes currently proposed for 3.0 (fixing text compression is another one), I believe a change to the header to make it future-proof deserves its own PR. Edit: I now see @KyleMaas already opened a discussion for this: #3932

My overall assessment is that the mesh as a whole is pretty unusable other than to toy around and experiment (which is cool).

What kind of issues are you having and what makes you believe this has to do with routing? For broadcasts the current managed flooding approach is really not too bad. I'm not convinced there even exists a protocol that's guaranteed to be better in all the different use cases for Meshtastic (especially fixed or mobile nodes, or a mix), that doesn't require an internet connection and can run on microprocessors with limited RAM. For "smarter" routing you need periodic control messages, but those eat up the same bandwidth as regular traffic.

Other problems with wireless mesh networks, e.g. the hidden node problem or unstable/changing RF conditions are not solved with better routing.

@KyleMaas
Copy link

@GUVWAF Yeah, starting that discussion to ask about forward compatibility is what led me into joining the discussion on this issue. I have several places in mind to add nodes that would vastly increase coverage in my area, but they are inaccessible enough that once they're up there, I'm probably realistically only going to be able to update the firmware maybe every 12 months or so. So I need to wait for a version of the firmware that has enough interoperability, or have all of the proposed protocol changes interoperable enough with older versions, that I can actually make these nodes available to the community. All of the current nodes I run are in areas which are at least reasonably accessible. But there's no sense in me taking on the time, effort, and expense of building out the expansions if it's going to be completely obsolete in a month - it makes more sense to wait for stability and not recommend Meshtastic to anyone who isn't already in our local coverage zones. So protocol stability and interoperability is extremely important to me.

@garthvh
Copy link
Member

garthvh commented May 20, 2024

3.0 is like 18 months away best case. Honestly not really enough there yet to even consider it.

@roha-github
Copy link

@GUVWAF "Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast)"

Where are such far-reaching decisions discussed and decided? What exactly is the background? The minimal increase in bandwidth is "bought" with a loss of range and all 2.x routers & repeaters.

Alternatively, 3.x clients switch to Long / Fast or both networks run in parallel and apart.

@GUVWAF
Copy link
Member Author

GUVWAF commented May 20, 2024

Where are such far-reaching decisions discussed and decided?

This was discussed by admins on Discord, but is also just an idea right now. When 3.0 comes closer, there will likely be an RFC for this.
Let's now continue this discussion in #3932.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants