With low peers (< 10), sync manager may get stuck in with workers map full of 'w'/'Q' #5794

etan-status · 2024-01-19T00:22:44Z

Each sync worker gets their job, slots 0 ..< 16, 16 ..< 32, 32 ..< 48 and so on
Assume num available peers < SyncWorkersCount - 1 --> some workers will be stuck in w mode
All workers except the first one succeed --> 16 ..< 32, 32 ..< 48 etc are now in Q mode
First worker fails --> descores peer --> retries task or picks a new one, doesn't matter
It will select same peer again, if it's a retry, assume request keeps failing, descore peer until disconnect, then the worker is stuck in w mode. If it's a new task, assume it succeeds, then the worker is stuck in Q mode.

After a while, every peer is stuck in w or in Q mode, and 0 ..< 16 request is not fulfilled.

w means 'waiting for peer that is not locked by another sync worker'
Q means 'waiting for parent result to become available before we can process our own result'

Some ideas on possible approaches to tackle this:

Should we release peers while in Q mode so that w workers can acquire them? We can still de-score them if they are still around by the time we transition to P, but we no longer exclusively control them.
Should we change SyncWorkersCount dynamically based on number of available peers?
Should we relax descoring / disconnects while having < SyncWorkersCount peers?

This popped up in a tiny devnet with just 10 peers, where some of the peers provided incomplete responses (blocks without blobs within the spec mandated retention period), leading to gradual descores / disconnects until this situation arised.

The text was updated successfully, but these errors were encountered:

etan-status · 2024-01-19T00:46:26Z

Example:

 peers: 3 ❯ finalized: 07f3e42a:7342 ❯ head: fe749777:7344:8 ❯ time: 15297:2 (244754) ❯ sync: 01d12h44m (48.41%) 0.9467slots/s (QwwwwQwwwQ:117503)/opt

etan-status · 2024-01-19T01:37:59Z

Regarding the point above:

Suppressing descoring / disconnects while having < SyncWorkersCount peers

e.g., by changing:

proc checkPeerScore*[A, B](pool: PeerPool[A, B], peer: A): bool

to not kick peers in such a situation, it appears to work well first, but it will trigger nasty endless loop after some hours of seemingly being alright.

So, no.

Syncing worker is run in this loop:

while true:
    var peer: A = nil
    let doBreak =
      try:
        man.workers[index].status = SyncWorkerStatus.Sleeping
        # This event is going to be set until we are not in sync with network
        await man.notInSyncEvent.wait()
        man.workers[index].status = SyncWorkerStatus.WaitingPeer
        peer = await man.pool.acquire()
        await man.syncStep(index, peer)
        man.pool.release(peer)
        false
      except CancelledError:
        if not(isNil(peer)):
          man.pool.release(peer)
        true
      except CatchableError as exc:
        debug "Unexpected exception in sync worker",
              peer = peer, peer_score = peer.getScore(),
              peer_speed = peer.netKbps(),
              errName = exc.name, errMsg = exc.msg
        true
    if doBreak:
      break

Note how there is no yield point between man.pool.release(peer) and await man.pool.acquire.
release synchronously marks the peer as being available for acquire without yield.

Inside syncStep, it then does this:

      # Avoid a stampede of requests, but make them more frequent in case the
      # peer is "close" to the slot range of interest
      if peerStatusAge < StatusExpirationTime div 2:
        await sleepAsync(StatusExpirationTime div 2 - peerStatusAge)

      trace "Updating peer's status information", wall_clock_slot = wallSlot,
            remote_head_slot = peerSlot, local_head_slot = headSlot

      try:
        let res = await peer.updateStatus()
        if not(res):
          peer.updateScore(PeerScoreNoStatus)
          debug "Failed to get remote peer's status, exiting",
                peer_head_slot = peerSlot

          return
      except CatchableError as exc:
        debug "Unexpected exception while updating peer's status",
              peer_head_slot = peerSlot, errName = exc.name, errMsg = exc.msg
        return

If the peerStatusAge is old, there is no yield by the first await.

And, if the peer is internally already disconnected, await peer.updateStatus synchronously fails, also not yielding.

So, endless loop where it acquires a peer with broken connection, then tries to update status, which immediately fails, then releases the peer, which doesn't disconnect due to the checkPeerScore relaxation, and then acquires it immediately again. In an endless loop, without any other futures getting a chance to run :)

Sample of nimbus_beacon_node_gnosis.txt

Smells a bit like a separate issue, that acquire doesn't check if a peer is disconnecting/disconnected. But, it's not a good idea to just patch checkPeerScore

cheatfate · 2024-01-22T12:02:40Z

There is a problem that we somehow processing task which should be visible asP disappeared. So when you see Q peers there MUST be present single instance of P. If its not present - we got stuck, because Q workers are waiting for signal from the queue that they could send their results to it, but queue will not send this signal until P processing task will not complete, so there some issue.

etan-status · 2024-01-23T12:15:58Z

Only the earliest task can enter P. If it fails D and there is no un-acquired peer to retry it (low peer scenario), it will be stuck at w.

Minimal example

worker 1 gets task A, worker 2 gets task B, status ww
2 peers connect
both tasks acquire a peer, status DD
second task succeeds, status DQ, both peers are still acquired
first task fails, descores peer and disconnects it, status wQ, the second peer is still around but is acquired by Q
no new peer connects, first task will never complete and second task also will never complete.

cheatfate · 2024-01-23T12:19:07Z

But this scenario is not actually stuck scenario, i should check it, but even if it happens as soon as new peer appears it will continue working.

etan-status · 2024-01-24T12:05:37Z

Yes, if the network has enough healthy peers it recovers eventually.

This was observed on a tiny devnet with only 10 nodes in total, half of them providing incomplete responses for downloads. In this situation, all healthy peers were acquired in Q. Nimbus was the only one getting stuck in sync on this network.

Agree it's not a high priority fix for real public networks.

cheatfate · 2024-03-21T04:04:24Z

I think you should either merge PR fixing issue or close issue :)

arnetheduck mentioned this issue Jan 19, 2024

simplify eth2_network error handling #5765

Merged

cheatfate mentioned this issue Jan 28, 2024

BN, add timeout for sync workers which are waiting in queue. #5831

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

With low peers (< 10), sync manager may get stuck in with workers map full of 'w'/'Q' #5794

With low peers (< 10), sync manager may get stuck in with workers map full of 'w'/'Q' #5794

etan-status commented Jan 19, 2024

etan-status commented Jan 19, 2024

etan-status commented Jan 19, 2024

cheatfate commented Jan 22, 2024

etan-status commented Jan 23, 2024 •

edited

cheatfate commented Jan 23, 2024

etan-status commented Jan 24, 2024

cheatfate commented Mar 21, 2024

With low peers (< 10), sync manager may get stuck in with workers map full of 'w'/'Q' #5794

With low peers (< 10), sync manager may get stuck in with workers map full of 'w'/'Q' #5794

Comments

etan-status commented Jan 19, 2024

etan-status commented Jan 19, 2024

etan-status commented Jan 19, 2024

cheatfate commented Jan 22, 2024

etan-status commented Jan 23, 2024 • edited

cheatfate commented Jan 23, 2024

etan-status commented Jan 24, 2024

cheatfate commented Mar 21, 2024

etan-status commented Jan 23, 2024 •

edited