Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With low peers (< 10), sync manager may get stuck in with workers map full of 'w'/'Q' #5794

Open
etan-status opened this issue Jan 19, 2024 · 7 comments

Comments

@etan-status
Copy link
Contributor

  1. Each sync worker gets their job, slots 0 ..< 16, 16 ..< 32, 32 ..< 48 and so on
  2. Assume num available peers < SyncWorkersCount - 1 --> some workers will be stuck in w mode
  3. All workers except the first one succeed --> 16 ..< 32, 32 ..< 48 etc are now in Q mode
  4. First worker fails --> descores peer --> retries task or picks a new one, doesn't matter
  5. It will select same peer again, if it's a retry, assume request keeps failing, descore peer until disconnect, then the worker is stuck in w mode. If it's a new task, assume it succeeds, then the worker is stuck in Q mode.

After a while, every peer is stuck in w or in Q mode, and 0 ..< 16 request is not fulfilled.

  • w means 'waiting for peer that is not locked by another sync worker'
  • Q means 'waiting for parent result to become available before we can process our own result'

Some ideas on possible approaches to tackle this:

  • Should we release peers while in Q mode so that w workers can acquire them? We can still de-score them if they are still around by the time we transition to P, but we no longer exclusively control them.
  • Should we change SyncWorkersCount dynamically based on number of available peers?
  • Should we relax descoring / disconnects while having < SyncWorkersCount peers?

This popped up in a tiny devnet with just 10 peers, where some of the peers provided incomplete responses (blocks without blobs within the spec mandated retention period), leading to gradual descores / disconnects until this situation arised.

@etan-status
Copy link
Contributor Author

Example:

 peers: 3 ❯ finalized: 07f3e42a:7342 ❯ head: fe749777:7344:8 ❯ time: 15297:2 (244754) ❯ sync: 01d12h44m (48.41%) 0.9467slots/s (QwwwwQwwwQ:117503)/opt

@etan-status
Copy link
Contributor Author

Regarding the point above:

  • Suppressing descoring / disconnects while having < SyncWorkersCount peers

e.g., by changing:

proc checkPeerScore*[A, B](pool: PeerPool[A, B], peer: A): bool

to not kick peers in such a situation, it appears to work well first, but it will trigger nasty endless loop after some hours of seemingly being alright.

So, no.


Syncing worker is run in this loop:

while true:
    var peer: A = nil
    let doBreak =
      try:
        man.workers[index].status = SyncWorkerStatus.Sleeping
        # This event is going to be set until we are not in sync with network
        await man.notInSyncEvent.wait()
        man.workers[index].status = SyncWorkerStatus.WaitingPeer
        peer = await man.pool.acquire()
        await man.syncStep(index, peer)
        man.pool.release(peer)
        false
      except CancelledError:
        if not(isNil(peer)):
          man.pool.release(peer)
        true
      except CatchableError as exc:
        debug "Unexpected exception in sync worker",
              peer = peer, peer_score = peer.getScore(),
              peer_speed = peer.netKbps(),
              errName = exc.name, errMsg = exc.msg
        true
    if doBreak:
      break

Note how there is no yield point between man.pool.release(peer) and await man.pool.acquire.
release synchronously marks the peer as being available for acquire without yield.

Inside syncStep, it then does this:

      # Avoid a stampede of requests, but make them more frequent in case the
      # peer is "close" to the slot range of interest
      if peerStatusAge < StatusExpirationTime div 2:
        await sleepAsync(StatusExpirationTime div 2 - peerStatusAge)

      trace "Updating peer's status information", wall_clock_slot = wallSlot,
            remote_head_slot = peerSlot, local_head_slot = headSlot

      try:
        let res = await peer.updateStatus()
        if not(res):
          peer.updateScore(PeerScoreNoStatus)
          debug "Failed to get remote peer's status, exiting",
                peer_head_slot = peerSlot

          return
      except CatchableError as exc:
        debug "Unexpected exception while updating peer's status",
              peer_head_slot = peerSlot, errName = exc.name, errMsg = exc.msg
        return

If the peerStatusAge is old, there is no yield by the first await.

And, if the peer is internally already disconnected, await peer.updateStatus synchronously fails, also not yielding.

So, endless loop where it acquires a peer with broken connection, then tries to update status, which immediately fails, then releases the peer, which doesn't disconnect due to the checkPeerScore relaxation, and then acquires it immediately again. In an endless loop, without any other futures getting a chance to run :)

Sample of nimbus_beacon_node_gnosis.txt


Smells a bit like a separate issue, that acquire doesn't check if a peer is disconnecting/disconnected. But, it's not a good idea to just patch checkPeerScore

@cheatfate
Copy link
Contributor

There is a problem that we somehow processing task which should be visible asP disappeared. So when you see Q peers there MUST be present single instance of P. If its not present - we got stuck, because Q workers are waiting for signal from the queue that they could send their results to it, but queue will not send this signal until P processing task will not complete, so there some issue.

@etan-status
Copy link
Contributor Author

etan-status commented Jan 23, 2024

Only the earliest task can enter P. If it fails D and there is no un-acquired peer to retry it (low peer scenario), it will be stuck at w.

Minimal example

  • worker 1 gets task A, worker 2 gets task B, status ww
  • 2 peers connect
  • both tasks acquire a peer, status DD
  • second task succeeds, status DQ, both peers are still acquired
  • first task fails, descores peer and disconnects it, status wQ, the second peer is still around but is acquired by Q
  • no new peer connects, first task will never complete and second task also will never complete.

@cheatfate
Copy link
Contributor

But this scenario is not actually stuck scenario, i should check it, but even if it happens as soon as new peer appears it will continue working.

@etan-status
Copy link
Contributor Author

Yes, if the network has enough healthy peers it recovers eventually.

This was observed on a tiny devnet with only 10 nodes in total, half of them providing incomplete responses for downloads. In this situation, all healthy peers were acquired in Q. Nimbus was the only one getting stuck in sync on this network.

Agree it's not a high priority fix for real public networks.

@cheatfate
Copy link
Contributor

I think you should either merge PR fixing issue or close issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants