fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349

stormshield-frb · 2024-04-30T13:32:15Z

Description

After testing master, we encountered a bug due to #4838 when doing automatic or periodic bootstrap if the node has no known peers.

Since it failed immediately, I though there was no need to call the bootstrap_status.on_started method. But no doing so never reset the periodic timer inside bootstrap_status resulting in getting stuck to try to bootstrap every time poll is called on kad::Behaviour.

Notes & open questions

N/A

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
A changelog entry has been made in the appropriate crates

protocols/kad/CHANGELOG.md

guillaumemichel · 2024-04-30T14:33:44Z

This looks a bit hacky, wouldn't it be better to modify the bootstrap Status instead (e.g poll_next_bootstrap)?

stormshield-frb · 2024-05-02T07:39:21Z

This looks a bit hacky, wouldn't it be better to modify the bootstrap Status instead (e.g poll_next_bootstrap)?

I'm not sure to understand what you mean. on_started and on_finished are intended for that purpose.

Even if I would update the Status directly, we would not be able to remove on_started completely since the end user could still manually trigger a bootstrap, and we would not be able to remove on_finish at all since there is currently no way to detect a bootstrap has finished outside exploring query_finished or query_timeout. And since a bootstrap can also fail immediately, we have to handle that there.

I agree that it would feel better to have a passive way to learn that a bootstrap did start or finish but I don't see how to implement that in a reasonably simple manner.

The reason we need to know if a bootstrap as started or finished is because we don't want to cascade bootstrap requests. When a bootstrap is triggered (no matter if it was automatic, periodic or manual), we reset the automatic and periodic timer to their initial value.

guillaumemichel · 2024-05-02T11:26:44Z

protocols/kad/src/behaviour.rs

@@ -931,6 +931,7 @@ where
    /// This parameter is used to call [`Behaviour::bootstrap`] periodically and automatically
    /// to ensure a healthy routing table.
    pub fn bootstrap(&mut self) -> Result<QueryId, NoKnownPeers> {
+        self.bootstrap_status.on_started();


What I meant is that if peers.is_empty() we may not want to call self.bootstrap_status.on_started() at all, because we aren't performing a bootstrap.

If a periodic timer needs to be reset within the Status then the timer reset should probably be implemented there rather than calling on_started(), because no actual bootstrap is started. Or maybe I didn't understand what the original issue is?

Sorry for the long time to reply.

I do agree a little bit with what you say. However, we encountered another issue with this on another matter.

Before periodic / automatic bootstrap, the user could always react to a NoKnownPeers error because it was he that triggered the bootstrap. However, since now the bootstrap is triggered automatically, the user is never notified if a NoKnownPeers error occurs.

That is why in our fork we have now removed the special case where we check if our routing table is empty before triggering a bootstrap : we always trigger it. Doing so allows us to receive an OutboundQueryProgressed with empty stats, allowing us to learn that a bootstrap has failed. We can then react to it and do some stuffs.

If you agree with this change (removing the check for empty routing table), then I think we can close this conversation because we can call self.bootstrap_status.on_started() because we actually always trigger one.

What do you think ? Do you agree that there is no need to check for empty routing table before triggering a bootstrap ? (no other kad call does this by the way).

dariusc93 reviewed Apr 30, 2024

View reviewed changes

protocols/kad/CHANGELOG.md Outdated Show resolved Hide resolved

stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch 2 times, most recently from 67e9aca to ec9898f Compare May 2, 2024 07:42

guillaumemichel reviewed May 2, 2024

View reviewed changes

stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch from ec9898f to f2d3e48 Compare June 5, 2024 10:33

fix(kad): always trigger a query when bootstrapping

f0ee433

stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch from f2d3e48 to f0ee433 Compare June 5, 2024 10:36

stormshield-frb mentioned this pull request Jun 5, 2024

Kademlia bootstrap gets stuck forever in some cases #5432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349

fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349

stormshield-frb commented Apr 30, 2024

guillaumemichel commented Apr 30, 2024

stormshield-frb commented May 2, 2024

guillaumemichel May 2, 2024

stormshield-frb Jun 5, 2024

fix(kad): correctly handle NoKnownPeers error when bootstrap #5349

Are you sure you want to change the base?

fix(kad): correctly handle NoKnownPeers error when bootstrap #5349

Conversation

stormshield-frb commented Apr 30, 2024

Description

Notes & open questions

Change checklist

guillaumemichel commented Apr 30, 2024

stormshield-frb commented May 2, 2024

guillaumemichel May 2, 2024

Choose a reason for hiding this comment

stormshield-frb Jun 5, 2024

Choose a reason for hiding this comment

fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349

fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349