Skip to content
This repository has been archived by the owner on Feb 15, 2024. It is now read-only.

Stuck Verifier #46

Open
EggPool opened this issue Apr 13, 2022 · 24 comments
Open

Stuck Verifier #46

EggPool opened this issue Apr 13, 2022 · 24 comments

Comments

@EggPool
Copy link

EggPool commented Apr 13, 2022

Seems like a different issue, no red outline on nyzo.co.

Verifier was showing as yellow, tracking issue. Stuck at block 16148619
v611002

^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16148619
broadcasting message: BlockVote19 to 2575
trying to fetch BlockRequest11 from SpaceX
trying to fetch MissingBlockVoteRequest23 from CITIZEN12
trying to fetch MissingBlockVoteRequest23 from S-Dedicated-2
trying to fetch MissingBlockVoteRequest23 from Yearbooks
trying to fetch MissingBlockVoteRequest23 from guru
trying to fetch MissingBlockVoteRequest23 from ProMac11
trying to fetch MissingBlockVoteRequest23 from iPhone
trying to fetch MissingBlockVoteRequest23 from KOyou 9
trying to fetch MissingBlockVoteRequest23 from Wendell
trying to fetch MissingBlockVoteRequest23 from Dailreport5
trying to fetch MissingBlockVoteRequest23 from MAMBA1
requesting block with votes for height 16148619, interval=4662
trying to fetch BlockWithVotesRequest37 from gunray5
nodejoin_from 85.244.93.155 2756...c1b5 joelpandrade
trying to fetch BlockRequest11 from ff
nodejoin_from 51.89.123.85 4c3e...0274 barbora
trying to fetch BlockRequest11 from CCC 19
trying to fetch BlockRequest11 from ff
trying to fetch MissingBlockVoteRequest23 from Do_You_Try3
trying to fetch MissingBlockVoteRequest23 from gateway011
trying to fetch MissingBlockVoteRequest23 from Togepi
trying to fetch MissingBlockVoteRequest23 from DoYouW1
trying to fetch MissingBlockVoteRequest23 from Signcycle 16
trying to fetch MissingBlockVoteRequest23 from CrazyB01
trying to fetch MissingBlockVoteRequest23 from Glasses
trying to fetch MissingBlockVoteRequest23 from MAMBA1
trying to fetch MissingBlockVoteRequest23 from barbora28Fjn
trying to fetch MissingBlockVoteRequest23 from mCAFE 20
requesting block with votes for height 16148619, interval=4065
trying to fetch BlockWithVotesRequest37 from Alex Senemar 15
waiting for message queue to clear from thread [Verifier-mainLoop], size is 1
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16148619
broadcasting message: BlockVote19 to 2575
trying to fetch BlockRequest11 from GramT22
top verifier be87...d9df has 2394 votes with a cycle length of 2545 (94.1%)
added new out-of-cycle node to queue: 16d9...4a64
nodejoin_from 158.69.33.13 16d9...4a64 noway78
trying to fetch BlockRequest11 from Res0ec21
added new out-of-cycle node to queue: ef33...95b3
nodejoin_from 51.161.50.73 ef33...95b3 bmw3251010
trying to fetch BlockRequest11 from LASNA 22
trying to fetch MissingBlockVoteRequest23 from Foxwie7
trying to fetch MissingBlockVoteRequest23 from Cory
trying to fetch MissingBlockVoteRequest23 from ovyzo11-7
trying to fetch MissingBlockVoteRequest23 from Nyzo rules
trying to fetch MissingBlockVoteRequest23 from VUL1RT
trying to fetch MissingBlockVoteRequest23 from Alita Angel
trying to fetch MissingBlockVoteRequest23 from bull0002
trying to fetch MissingBlockVoteRequest23 from Gaara988
trying to fetch MissingBlockVoteRequest23 from Lucky24111
trying to fetch MissingBlockVoteRequest23 from Mopsu01
requesting block with votes for height 16148619, interval=4044
trying to fetch BlockWithVotesRequest37 from Andy
trying to fetch BlockRequest11 from Clients
added new out-of-cycle node to queue: 5c1e...7186
nodejoin_from 51.89.123.105 5c1e...7186 barbora
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16148619
broadcasting message: BlockVote19 to 2575
trying to fetch BlockRequest11 from eNagaina5346
top verifier be87...d9df has 2398 votes with a cycle length of 2545 (94.2%)
trying to fetch MissingBlockVoteRequest23 from mototoya000
trying to fetch MissingBlockVoteRequest23 from Yuge_Chungus
trying to fetch MissingBlockVoteRequest23 from rumba6
trying to fetch MissingBlockVoteRequest23 from TOBE19
trying to fetch MissingBlockVoteRequest23 from Gogirl A19
trying to fetch MissingBlockVoteRequest23 from HELLO63
trying to fetch MissingBlockVoteRequest23 from Sakura31356
trying to fetch MissingBlockVoteRequest23 from Keep18840
trying to fetch MissingBlockVoteRequest23 from Artboy11
trying to fetch MissingBlockVoteRequest23 from noway74
requesting block with votes for height 16148619, interval=4612
trying to fetch BlockWithVotesRequest37 from pvip150
nodejoin_from 3.232.136.94 0991...c1d0 Alphabet41
waiting for message queue to clear from thread [Verifier-mainLoop], size is 1
nodejoin_from 54.38.34.75 e7cd...a411 Nescafe15
trying to fetch BlockRequest11 from Hilarious50492
nodejoin_from 135.125.190.4 3faf...7416 Alice
trying to fetch BlockRequest11 from aaa004
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16148619
broadcasting message: BlockVote19 to 2575
trying to fetch BlockRequest11 from Obama6
trying to fetch MissingBlockVoteRequest23 from mrle310
trying to fetch MissingBlockVoteRequest23 from *N*pass15763
trying to fetch MissingBlockVoteRequest23 from Gogirl A19
trying to fetch MissingBlockVoteRequest23 from MCR2023
trying to fetch MissingBlockVoteRequest23 from Binance
trying to fetch MissingBlockVoteRequest23 from Sam_FTX 11
trying to fetch MissingBlockVoteRequest23 from Blockcollider14
trying to fetch MissingBlockVoteRequest23 from flower 11
trying to fetch MissingBlockVoteRequest23 from Input14
trying to fetch MissingBlockVoteRequest23 from BelongB16
requesting block with votes for height 16148619, interval=4785
trying to fetch BlockWithVotesRequest37 from mototoya008
trying to fetch BlockRequest11 from Fat Man
trying to fetch BlockRequest11 from Charmander
top verifier be87...d9df has 2397 votes with a cycle length of 2545 (94.2%)
added new out-of-cycle node to queue: f72d...892c
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.60.76 f72d...892c pvip108
nodejoin_from 18.191.26.157 29d8...e55e PigCash40
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16148619
broadcasting message: BlockVote19 to 2575
nodejoin_from 116.203.33.209 3aa8...f753 OhHaiNyzo-1
trying to fetch BlockRequest11 from bull0002
trying to fetch MissingBlockVoteRequest23 from Belinda
trying to fetch MissingBlockVoteRequest23 from Lacia05
trying to fetch MissingBlockVoteRequest23 from f5
trying to fetch MissingBlockVoteRequest23 from Fushimi Inari
trying to fetch MissingBlockVoteRequest23 from elasticsearch012
trying to fetch MissingBlockVoteRequest23 from Y20Y21
trying to fetch MissingBlockVoteRequest23 from IIHLOHA11
trying to fetch MissingBlockVoteRequest23 from Giacomo20
trying to fetch MissingBlockVoteRequest23 from SBLY3
trying to fetch MissingBlockVoteRequest23 from snjor139
requesting block with votes for height 16148619, interval=4169
trying to fetch BlockWithVotesRequest37 from rumba12
waiting for message queue to clear from thread [Verifier-mainLoop], size is 2
nodejoin_from 52.214.208.238 26b8...eb3a Forbole2
trying to fetch BlockRequest11 from lucyLuky15
trying to fetch BlockRequest11 from Fireplug
added new out-of-cycle node to queue: 1005...b6d0
nodejoin_from 142.44.145.117 1005...b6d0 automotives85
trying to fetch BlockRequest11 from Shameful41611
trying to fetch MissingBlockVoteRequest23 from apk
trying to fetch MissingBlockVoteRequest23 from Mask Jame 3
trying to fetch MissingBlockVoteRequest23 from Only02
trying to fetch MissingBlockVoteRequest23 from NRV-SC17
trying to fetch MissingBlockVoteRequest23 from Omyback28201
trying to fetch MissingBlockVoteRequest23 from ajmo1
trying to fetch MissingBlockVoteRequest23 from rFirst
trying to fetch MissingBlockVoteRequest23 from Nash01
trying to fetch MissingBlockVoteRequest23 from rmb
trying to fetch MissingBlockVoteRequest23 from Sam_FTX 8
requesting block with votes for height 16148619, interval=4221
trying to fetch BlockWithVotesRequest37 from Arionum04
waiting for message queue to clear from thread [Verifier-mainLoop], size is 3
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16148619
broadcasting message: BlockVote19 to 2575
trying to fetch BlockRequest11 from CPRCP7
top verifier be87...d9df has 2397 votes with a cycle length of 2545 (94.2%)
added new out-of-cycle node to queue: ff30...8af8
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.28.69 ff30...8af8 burr170
trying to fetch BlockRequest11 from RICKY37
added new out-of-cycle node to queue: 9fd7...141b
removed node from new out-of-cycle queue due to size
nodejoin_from 51.89.123.109 9fd7...141b barbora
trying to fetch BlockRequest11 from Elon Musk
added new out-of-cycle node to queue: 846e...2293
nodejoin_from 144.217.196.57 846e...2293 noway58
trying to fetch MissingBlockVoteRequest23 from A1200O4898
trying to fetch MissingBlockVoteRequest23 from boogie315208
trying to fetch MissingBlockVoteRequest23 from CITIZEN1
trying to fetch MissingBlockVoteRequest23 from wdc-v1
trying to fetch MissingBlockVoteRequest23 from Jack-LA14
trying to fetch MissingBlockVoteRequest23 from Getting_12
trying to fetch MissingBlockVoteRequest23 from FBG26505
trying to fetch MissingBlockVoteRequest23 from 3359
trying to fetch MissingBlockVoteRequest23 from win10
trying to fetch MissingBlockVoteRequest23 from 11Ncode14659
requesting block with votes for height 16148619, interval=4295
trying to fetch BlockWithVotesRequest37 from Randolph Scott 3
waiting for message queue to clear from thread [Verifier-mainLoop], size is 1
trying to fetch BlockRequest11 from NvGirl14435
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16148619
broadcasting message: BlockVote19 to 2575
nodejoin_from 137.184.235.65 d774...ad5c barrick
trying to fetch BlockRequest11 from KingSuper16
added new out-of-cycle node to queue: 120b...8434
nodejoin_from 66.70.130.67 120b...8434 DeadKHBEf
trying to fetch BlockRequest11 from sm1gunray8
top verifier be87...d9df has 2397 votes with a cycle length of 2545 (94.2%)
trying to fetch MissingBlockVoteRequest23 from IOI-Again
trying to fetch MissingBlockVoteRequest23 from Harmony11
trying to fetch MissingBlockVoteRequest23 from LASNA 83
trying to fetch MissingBlockVoteRequest23 from CARplay18
trying to fetch MissingBlockVoteRequest23 from Cinco de Cuatro
trying to fetch MissingBlockVoteRequest23 from evropovkirilovfilipas
trying to fetch MissingBlockVoteRequest23 from MOyAde4
trying to fetch MissingBlockVoteRequest23 from gunray5
trying to fetch MissingBlockVoteRequest23 from inco052
trying to fetch MissingBlockVoteRequest23 from xOlife6342
requesting block with votes for height 16148619, interval=4085
trying to fetch BlockWithVotesRequest37 from Hard Time
nodejoin_from 46.137.39.75 23dd...765e Polychain3
nodejoin_from 145.239.82.153 8c18...6a7f DEALER12
waiting for message queue to clear from thread [Verifier-mainLoop], size is 3
trying to fetch BlockRequest11 from Hulama19
trying to fetch BlockRequest11 from aq5293
added 51.15.86.249 to dynamic whitelist
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16148619
broadcasting message: BlockVote19 to 2575
added new out-of-cycle node to queue: ba1f...1a05
nodejoin_from 144.217.204.29 ba1f...1a05 Nutka130
added new out-of-cycle node to queue: 7ed5...cafd
nodejoin_from 51.89.89.7 7ed5...cafd gogo57411
added new out-of-cycle node to queue: 6845...2099
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.50.77 6845...2099 bmw3251014
added new out-of-cycle node to queue: e1cb...3f0f
removed node from new out-of-cycle queue due to size
nodejoin_from 158.69.33.17 e1cb...3f0f noway82
trying to fetch BlockRequest11 from Armored1595
trying to fetch MissingBlockVoteRequest23 from Nash03
trying to fetch MissingBlockVoteRequest23 from log002
trying to fetch MissingBlockVoteRequest23 from BURGERKING14
trying to fetch MissingBlockVoteRequest23 from videoMAC11
trying to fetch MissingBlockVoteRequest23 from MAMBA3
trying to fetch MissingBlockVoteRequest23 from 10Dollar13
trying to fetch MissingBlockVoteRequest23 from NvGirl14435
trying to fetch MissingBlockVoteRequest23 from Cool01
trying to fetch MissingBlockVoteRequest23 from EasyD3
trying to fetch MissingBlockVoteRequest23 from z0rn-rac-11453
requesting block with votes for height 16148619, interval=6145
trying to fetch BlockWithVotesRequest37 from Richkey13
nodejoin_from 3.97.146.179 6ccf...39e6 KalpaTech18
added new out-of-cycle node to queue: b104...787e
removed node from new out-of-cycle queue due to size
nodejoin_from 54.213.210.50 b104...787e Coverlet23
waiting for message queue to clear from thread [Verifier-mainLoop], size is 3
trying to fetch BlockRequest11 from but_ton22
top verifier be87...d9df has 2397 votes with a cycle length of 2545 (94.2%)
nodejoin_from 94.23.161.249 c037...3448 corr03
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16148619
broadcasting message: BlockVote19 to 2575

aso...

When in this state, it won't move by itself, needs blocks resync and restart.
Also note the "removed node from new out-of-cycle queue due to size" lines.

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 13, 2022

If there's no red outline, it is almost certainly (~99.998% chance) a different issue. Also, the area around that block was rather uneventful.

  • 16148617: verified 1.799 seconds after end
  • 16148618: verified 1.583 seconds after end
  • 16148619: verified 1.706 seconds after end
  • 16148620: verified 1.553 seconds after end
  • 16148621: verified 1.711 seconds after end

Due to delays on vote flipping, we would never see such close times if there was any question about which block would be frozen.

Looking at your log, it appears that a temporary communication issue caused the verifier to miss votes for a block. We had a similar issue when one of our Digital Ocean droplets (their term for a VPS) went down last week, but as soon as the droplet came back to life, the verifier was able to sync back up.

The "removed node from new out-of-cycle queue due to size" message is not a problem. This map is kept small (1000) to protect the verifier, and the only consequence of this removal is that some new verifiers will be temporarily delayed in their eligibility for the lottery.

So, a few interesting questions here are:

  • Do you have an earlier error in the log file that would indicate why the verifier wasn't able to sync back up? An out-of-memory exception would be especially interesting.
  • Do you have a copy of the verifier status? This could be either text or screenshot. The part that shows unfrozen blocks and their votes is especially interesting for stuck (yellow) verifiers.
  • What do you mean precisely by, "When in this state, it won't move by itself, needs blocks resync and restart." Is this just a restart and the automatic resync process with the HistoricalChainFillter, or do you have another process that you run? In the past, we ran an SCP process to quickly resync verifiers that were lagging, but we haven't had to do that since the HistoricalChainFiller improvements in version 586 (https://tech.nyzo.co/releaseNotes/v586).
  • What are the specs of the VPS? This is especially interesting if you see an out-of-memory exception earlier in your log, and it could help us provide recommendations for some preferences.
  • Was the IP of this verifier either 167.71.130.144 or 157.230.130.22 at the time?

Side note: I had mentioned that we tend to assume honest misconfiguration, not malice, when we see high bandwidth usage from out-of-cycle verifiers. This was due to some earlier bandwidth issues we saw from some misconfigured out-of-cycle verifiers. I had forgotten the amount of bandwidth, though, and I had forgotten that the release notes for v586 noted it. At the time, we were consuming over 14TB of traffic per day to serve the seed transaction files, and this traffic was actually generated by a rather small group of out-of-cycle verifiers (about 50, I think).

@EggPool
Copy link
Author

EggPool commented Apr 14, 2022

Thanks for looking into it.

  • I don't have the previous logs anymore. Will try to save it next time this happens and I see soon enough.

  • Status when it was yellow:

IP address: 
last queried: 107.1 seconds ago
nickname: 
version: 611002
ID: 
mesh: 5624 total, 2523 in cycle
cycle length: 2545
transactions: 0
retention edge: 16138412
trailing edge: 16138436
frozen edge: 16148618 (Keep18840/13...a0)
open edge: 16148816
blocks transmitted/created: 40/99177
block vote: insufficient count=1460; h=16148619; 36a3...41e9
last join height: 16148187
last removal height: 16148469
receiving UDP: yes
- h: +1, n: 3, v: 1464(1464)
- h: +2, n: 0, v: 0(0)
- h: +3, n: 0, v: 0(0)
...
- h: +195, n: 0, v: 2520(2520)
- h: +196, n: 0, v: 2520(2520)
- h: +197, n: 0, v: 2519(2519)
  • "needs blocks resync and restart": My usual process to quickly restart stuck verifiers, when they are yellow and stuck at the same block, is to rsync the "individual" block folder from a tracking one and restart. Restarting alone - in previous similar cases, did not try there - either was not helping, or only moved up a few blocks then stalled again.

  • Not a vps but a dedicated machine, 8 cores, 32G ram. Ram was not exhausted at that time, but there was a cpu spike before and while it froze. Under regular circumstances, ram and cpu usage both are roughly at 50%.

  • IP either 167.71.130.144 or 157.230.130.22 at the time?
    None of these.

I'm not a java pro, but if oom errors can play havoc, or some background threads stop under some circumstances (I saw that while debugging the sentinel iirc: some threads stopped because of a non logued error, leaving the whole thing in a non fully working state), couldn't the process "just" log and stop when this happens, so it can be restarted clean by supervisord?

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 14, 2022

Your hardware is definitely not a problem. :)

The idea of logging/dying is really good. It's certainly preferable to getting stuck in a non-working state, and it might also help us identify and fix issues faster. This could probably be implemented robustly in the current design with checkpoints in all long-running threads and a watchdog that does a state dump and termination if any checkpoint is untouched for a specified interval.

Thank you for the detailed info. We'll dig a little more and let you know what we find.

@EggPool
Copy link
Author

EggPool commented Apr 19, 2022

Another verifier, likely same issue.
Vps with 6 cores, 16 Gb ram.
Cpu load was at 6 out of 6, down to 1.3 after rsync and restart. Verifier eating all cpu while stuck then.
No ram issue.

Verifier stuck at 16216601

IP address: 
last queried: 2.3 seconds ago
nickname: 
version: 610
ID: 
mesh: 4697 total, 2521 in cycle
cycle length: 2538
transactions: 0
retention edge: 16206422
trailing edge: 16206446
frozen edge: 16216600 (Protection/25...37)
open edge: 16227831
blocks transmitted/created: 21/50309
block vote: insufficient count=1812; h=16216601; 1b36...fbff
last join height: 16214747
last removal height: 16215740
receiving UDP: yes
- h: +1, n: 2, v: 1812(1812)
- h: +2, n: 0, v: 1646(1646)
- h: +3, n: 0, v: 1725(1725)
...
- h: +11229, n: 0, v: 0(0)
- h: +11230, n: 0, v: 0(0)
- h: +11231, n: 0, v: 0(0)

logs when stuck:

requesting block with votes for height 16216601, interval=70174
Exception sending message MissingBlockVoteRequest23 to 146.59.44.58:9444: exception: Connection reset; java.net.SocketOutputStream:115, java.net.SocketOutputStream:143, co.nyzo.verifier.Message$1:231, java.lang.Thread:748
trying to fetch BlockWithVotesRequest37 from Protection
nodejoin_from 104.200.136.49 b52e...1879 uc
...
nodejoin_from 149.56.80.201 7d7c...889b Irresistible45291
nodejoin_from 34.243.126.189 c97e...3add Mercury14
added new out-of-cycle node to queue: 0da1...93dc
removed node from new out-of-cycle queue due to size
nodejoin_from 3.248.221.152 0da1...93dc StakeFish7
nodejoin_from 135.181.105.116 d03d...aba6 Corsa
nodejoin_from 141.94.221.120 cff1...5220 MiniStand4
added new out-of-cycle node to queue: 24ea...f49c
removed node from new out-of-cycle queue due to size
nodejoin_from 18.119.150.92 24ea...f49c BartestneT16
nodejoin_from 54.171.145.228 96b9...7dd8 ushakov12
nodejoin_from 44.239.7.94 7dc7...3fb6 Coverlet9
added new out-of-cycle node to queue: ad10...a435
removed node from new out-of-cycle queue due to size
nodejoin_from 173.208.164.90 8906...b295 CAZ
added new out-of-cycle node to queue: 9813...0b55
nodejoin_from 192.99.132.226 ad10...a435 Frail6151
nodejoin_from 51.161.60.59 9813...0b55 pvip091
nodejoin_from 94.23.147.154 f6cc...551e corr13
added new out-of-cycle node to queue: 89bf...2818
removed node from new out-of-cycle queue due to size
...
removed node from new out-of-cycle queue due to size
nodejoin_from 144.168.59.68 399d...b366 biiNG3
trying to fetch BlockRequest11 from DO20YR83
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent restart of this verifier. If this continues for more than 50 blocks, please check your verifier configuration.
trying to fetch MissingBlockVoteRequest23 from qq
trying to fetch MissingBlockVoteRequest23 from Fiery3977
Exception sending message MissingBlockVoteRequest23 to 104.200.136.208:9444: exception: Connection reset; java.net.SocketOutputStream:115, java.net.SocketOutputStream:143, co.nyzo.verifier.Message$1:231, java.lang.Thread:748
Exception sending message MissingBlockVoteRequest23 to 167.99.191.82:9444: exception: Connection reset; java.net.SocketOutputStream:115, java.net.SocketOutputStream:143, co.nyzo.verifier.Message$1:231, java.lang.Thread:748
trying to fetch MissingBlockVoteRequest23 from Digital3
trying to fetch MissingBlockVoteRequest23 from bbda14
nodejoin_from 95.217.196.211 633e...80f5 Jacquelyn
trying to fetch MissingBlockVoteRequest23 from Only02
trying to fetch MissingBlockVoteRequest23 from DimiO77
trying to fetch MissingBlockVoteRequest23 from WZmay
trying to fetch MissingBlockVoteRequest23 from Jalali 28
trying to fetch MissingBlockVoteRequest23 from So cute11
trying to fetch MissingBlockVoteRequest23 from FALOWAS22
requesting block with votes for height 16216601, interval=63608
trying to fetch BlockWithVotesRequest37 from Poladya17853
waiting for message queue to clear from thread [Verifier-mainLoop], size is 7
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16216601

aso

Rewinded the logs to find the moment it went from tracking ok to stuck:

nodejoin_from 51.161.65.126 923d...90b4 terra305045
nodejoin_from 5.161.87.65 4a62...3ae4 wdc-hzvps-9526
freezing block [Block: v=2, height=16216600, hash=254c...5e37, id=78e7...a3cf] with standard mechanism
after registering block [Block: v=2, height=16216600, hash=254c...5e37, id=78e7...a3cf] in BlockchainMetricsManager(), cycleTransactionSum=∩0.000056, cycleLength=2538
BlockVoteManager: removing vote map of size 1906
trying to fetch BlockRequest11 from SecBay61
fetching block 16216601 (1b36...fbff) from mesh on ...
trying to fetch MissingBlockRequest25 from csn
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent restart of this verifier. If this continues for more than 50 blocks, pl>
nodejoin_from 18.200.151.187 1d77...384b 0xFury43
Exception sending message MissingBlockRequest25 to 138.197.178.34:9444: exception: Connection reset; java.net.SocketOutputStream:115, java.net.SocketOutputStream:143, co.nyzo.verifier.Message:231, java.lang.Thread:748

...
removed node from new out-of-cycle queue due to size
nodejoin_from 51.89.28.124 f345...ad74 avalon677855
nodejoin_from 49.12.8.145 c5a8...9a3c 10Dollar7
nodejoin_from 65.21.154.9 8be2...b8ec ProMac11
cleaning up because block 16216600 was frozen
sending verifier-removal vote to node at IP: 51.77.75.168
added new out-of-cycle node to queue: a623...80ff
removed node from new out-of-cycle queue due to size
nodejoin_from 18.221.95.101 a623...80ff Natalia37
nodejoin_from 95.217.130.145 86e3...f6fd snjor139
added new out-of-cycle node to queue: aa16...d173
removed node from new out-of-cycle queue due to size
nodejoin_from 5.161.87.195 aa16...d173 wdc-hzvps-32537
nodejoin_from 51.89.167.99 9610...c1c0 Sunday
added new out-of-cycle node to queue: d74e...a0f5
removed node from new out-of-cycle queue due to size
...
added new out-of-cycle node to queue: feb7...63f8
removed node from new out-of-cycle queue due to size
added new out-of-cycle node to queue: 7da6...2532
removed node from new out-of-cycle queue due to size
nodejoin_from 66.70.236.54 feb7...63f8 imdb623
nodejoin_from 51.89.47.0 7da6...2532 europe99767
nodejoin_from 137.74.41.14 38e9...c9b6 corr01
added new out-of-cycle node to queue: e45f...7042
removed node from new out-of-cycle queue due to size
....
trying to fetch BlockRequest11 from Dida15
Exception sending message BlockRequest11 to 78.46.204.141:9444: exception: Connection reset; java.net.SocketOutputStream:115, java.net.SocketOutputStream:143, co.nyzo.verifier.Message$1:>
added new out-of-cycle node to queue: dc8a...9f8d
removed node from new out-of-cycle queue due to size
...
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent restart of this verifier. If this continues for more than 50 blocks, pl>
trying to fetch MissingBlockVoteRequest23 from Urine_Trees
trying to fetch MissingBlockVoteRequest23 from ZBank20
trying to fetch MissingBlockVoteRequest23 from Zzzwww3
trying to fetch MissingBlockVoteRequest23 from FBG26505
trying to fetch MissingBlockVoteRequest23 from NV2080
trying to fetch MissingBlockVoteRequest23 from gogo48466
trying to fetch MissingBlockVoteRequest23 from DimiO93
trying to fetch MissingBlockVoteRequest23 from Foxwie8
trying to fetch MissingBlockVoteRequest23 from 792a...53b2
trying to fetch MissingBlockVoteRequest23 from bmw3251043
requesting block with votes for height 16216601, interval=130416
trying to fetch BlockWithVotesRequest37 from LASNA 51
waiting for message queue to clear from thread [Verifier-mainLoop], size is 4
Exception sending message MissingBlockVoteRequest23 to 51.79.243.15:9444: exception: Connection reset; java.net.SocketOutputStream:115, java.net.SocketOutputStream:143, co.nyzo.verifier.>
added new out-of-cycle node to queue: 897d...cf15
removed node from new out-of-cycle queue due to size
nodejoin_from 54.211.242.83 897d...cf15 Alphabet47
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16216601
broadcasting message: BlockVote19 to 2563
...
added new out-of-cycle node to queue: cb0a...69a9
removed node from new out-of-cycle queue due to size
nodejoin_from 35.183.49.83 cb0a...69a9 Natalia36
nodejoin_from 69.174.96.179 5b91...59e5 apk
added new out-of-cycle node to queue: 4993...ebfd
removed node from new out-of-cycle queue due to size
nodejoin_from 51.89.28.125 4993...ebfd avalon136829
nodejoin_from 51.77.162.147 88b8...200d bachus3
nodejoin_from 3.130.222.111 1361...f63b Messi45
nodejoin_from 5.161.85.184 9d50...fefe wdc-hzvps-18649
added new out-of-cycle node to queue: c6c9...fc67
removed node from new out-of-cycle queue due to size
...
trying to fetch BlockRequest11 from Miss You16
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent restart of this verifier. If this continues for more than 50 blocks, pl>
trying to fetch MissingBlockVoteRequest23 from z0rn-6151
trying to fetch MissingBlockVoteRequest23 from Marry 21
trying to fetch MissingBlockVoteRequest23 from Baby2
trying to fetch MissingBlockVoteRequest23 from loki
trying to fetch MissingBlockVoteRequest23 from 3359
trying to fetch MissingBlockVoteRequest23 from CITIZEN12
trying to fetch MissingBlockVoteRequest23 from Input14
trying to fetch MissingBlockVoteRequest23 from idHLyes20
trying to fetch MissingBlockVoteRequest23 from Techgo8
trying to fetch MissingBlockVoteRequest23 from Unartistically
requesting block with votes for height 16216601, interval=62366
trying to fetch BlockWithVotesRequest37 from Jrsmsa12169
waiting for message queue to clear from thread [Verifier-mainLoop], size is 4
Exception sending message BlockWithVotesRequest37 to 54.38.53.12:9444: exception: Connection reset; java.net.SocketOutputStream:115, java.net.SocketOutputStream:143, co.nyzo.verifier.Mes>
added new out-of-cycle node to queue: 4b25...1f59
removed node from new out-of-cycle queue due to size
nodejoin_from 34.242.113.244 4b25...1f59 StakeFish10
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16216601
broadcasting message: BlockVote19 to 2563
...
removed node from new out-of-cycle queue due to size
nodejoin_from 3.99.133.118 21ac...8138 ushakov13
nodejoin_from 18.224.183.125 67a3...c287 cros-nest27
nodejoin_from 18.191.83.115 3950...d3a1 cros-nest43
nodejoin_from 145.239.93.229 0c11...a153 zopsuuu04
added new out-of-cycle node to queue: 8902...10d8
removed node from new out-of-cycle queue due to size
nodejoin_from 5.161.80.171 8902...10d8 wdc-hzvps-13923
nodejoin_from 5.161.74.214 e56c...1d04 wdc-hz-7070
added new out-of-cycle node to queue: e430...3f9a
removed node from new out-of-cycle queue due to size
...

(no more exception sending messages BlockWithVotesRequest37 after restart, but no more requests for blocks with votes either))

Maybe more important:
**While this verifier was showing yellow on its detail page, it was displaying as white on nyzo.co homepage. **
This is not the first time we have this case, I experienced it myself already in the past.
It's just luck I was able to see in time.

@EggPool
Copy link
Author

EggPool commented Apr 20, 2022

Another verifier.
Lower spec vps (but similar symptoms)
2 cpu, 4Gb ram
cpu load at 4 out of 2 while stuck, down to 1.5 once restarted.
no ram issue, no exception in the logs

IP address: 
last queried: 118.2 seconds ago
nickname: 
version: 605
ID: 
mesh: 9404 total, 2516 in cycle
cycle length: 2534
transactions: 0
retention edge: 16224003
trailing edge: 16224027
frozen edge: 16234167 (Adidas/ff...a1)
open edge: 16234339
blocks transmitted/created: 5/10014
block vote: insufficient count=921; h=16234168; 2d08...e98d
last join height: 16229971
last removal height: 16232117
receiving UDP: yes
- h: +1, n: 3, v: 921(921)
- h: +2, n: 0, v: 0(0)
- h: +3, n: 0, v: 0(0)
...
- h: +170, n: 0, v: 0(0)
- h: +171, n: 0, v: 0(0)
- h: +172, n: 0, v: 0(0)

Logs from tracking to no more tracking:

requesting block with votes for height 16234166, interval=11204
trying to fetch BlockWithVotesRequest37 from TeslaNow
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16234166
broadcasting message: BlockVote19 to 2574
storing new vote, height=16234166, hash=e9f5...e4a2
freezing block [Block: v=2, height=16234166, hash=e9f5...e4a2, id=a551...28ab] with standard mechanism
after registering block [Block: v=2, height=16234166, hash=e9f5...e4a2, id=a551...28ab] in BlockchainMetricsManager(), $
BlockVoteManager: removing vote map of size 2502
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent rest$
cleaning up because block 16234166 was frozen
sending verifier-removal vote to node at IP: 88.99.190.87
maximum cycle transaction amount=∩100000.000000, balance=∩68641875.933043
maximum cycle transaction amount=∩100000.000000, balance=∩68641875.933043
maximum cycle transaction amount=∩100000.000000, balance=∩68641875.933043
maximum cycle transaction amount=∩100000.000000, balance=∩68641875.933043
trying to fetch MissingBlockVoteRequest23 from Digital3
trying to fetch MissingBlockVoteRequest23 from gateway003
trying to fetch MissingBlockVoteRequest23 from Xiahong19925
trying to fetch MissingBlockVoteRequest23 from DLDXJY13
trying to fetch MissingBlockVoteRequest23 from HummerZ
trying to fetch MissingBlockVoteRequest23 from mysql7
trying to fetch MissingBlockVoteRequest23 from inco11069
trying to fetch MissingBlockVoteRequest23 from tomtom10-1172
trying to fetch MissingBlockVoteRequest23 from Yellow452
trying to fetch MissingBlockVoteRequest23 from jolt2
requesting block with votes for height 16234167, interval=10556
trying to fetch BlockWithVotesRequest37 from CPRCP7
waiting for message queue to clear from thread [Verifier-mainLoop], size is 1
added new out-of-cycle node to queue: e433...6804
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16234167
broadcasting message: BlockVote19 to 2574
added new out-of-cycle node to queue: 741b...2958
storing new vote, height=16234167, hash=ff95...49a1
freezing block [Block: v=2, height=16234167, hash=ff95...49a1, id=c0c7...a093] with standard mechanism
after registering block [Block: v=2, height=16234167, hash=ff95...49a1, id=c0c7...a093] in BlockchainMetricsManager(), $
BlockVoteManager: removing vote map of size 2500
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent rest$
cleaning up because block 16234167 was frozen
sending verifier-removal vote to node at IP: 54.38.156.209
maximum cycle transaction amount=∩100000.000000, balance=∩68641875.933043
maximum cycle transaction amount=∩100000.000000, balance=∩68641875.933043
maximum cycle transaction amount=∩100000.000000, balance=∩68641875.933043
maximum cycle transaction amount=∩100000.000000, balance=∩68641875.933043
trying to fetch MissingBlockVoteRequest23 from ff
trying to fetch MissingBlockVoteRequest23 from Cool01
trying to fetch MissingBlockVoteRequest23 from barboraQJ3un
trying to fetch MissingBlockVoteRequest23 from tanat
trying to fetch MissingBlockVoteRequest23 from g6
trying to fetch MissingBlockVoteRequest23 from rumba7
trying to fetch MissingBlockVoteRequest23 from ZNMB1
trying to fetch MissingBlockVoteRequest23 from PjL-rac-8576
trying to fetch MissingBlockVoteRequest23 from Lion15902
trying to fetch MissingBlockVoteRequest23 from Amy
requesting block with votes for height 16234168, interval=10838
trying to fetch BlockWithVotesRequest37 from Jonathan
waiting for message queue to clear from thread [Verifier-mainLoop], size is 2
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16234168
broadcasting message: BlockVote19 to 2574
storing new vote, height=16234168, hash=2d08...e98d
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent rest$
trying to fetch MissingBlockVoteRequest23 from BabyK2
trying to fetch MissingBlockVoteRequest23 from UNICP30
trying to fetch MissingBlockVoteRequest23 from Time12345
trying to fetch MissingBlockVoteRequest23 from Kimberlee
trying to fetch MissingBlockVoteRequest23 from Safety8
trying to fetch MissingBlockVoteRequest23 from Malooshek48
trying to fetch MissingBlockVoteRequest23 from Techgo12
trying to fetch MissingBlockVoteRequest23 from FLYMOON12
trying to fetch MissingBlockVoteRequest23 from MrNyzo7
trying to fetch MissingBlockVoteRequest23 from binanceinfo
requesting block with votes for height 16234168, interval=9270
trying to fetch BlockWithVotesRequest37 from VR01b
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16234168
broadcasting message: BlockVote19 to 2574
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent rest$
trying to fetch MissingBlockVoteRequest23 from NyPunk0
trying to fetch MissingBlockVoteRequest23 from aq2714
trying to fetch MissingBlockVoteRequest23 from HHIAW27
trying to fetch MissingBlockVoteRequest23 from Tesla S
trying to fetch MissingBlockVoteRequest23 from gunray3s5
trying to fetch MissingBlockVoteRequest23 from Princeliest
trying to fetch MissingBlockVoteRequest23 from CITIZEN13
trying to fetch MissingBlockVoteRequest23 from Mitnick
trying to fetch MissingBlockVoteRequest23 from O888a32653
trying to fetch MissingBlockVoteRequest23 from outpost
requesting block with votes for height 16234168, interval=7035
trying to fetch BlockWithVotesRequest37 from leifeng27597
waiting for message queue to clear from thread [Verifier-mainLoop], size is 1
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16234168
broadcasting message: BlockVote19 to 2574
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent rest$
trying to fetch MissingBlockVoteRequest23 from Annabelle
trying to fetch MissingBlockVoteRequest23 from BeNoNoe1
trying to fetch MissingBlockVoteRequest23 from MAMBA14
trying to fetch MissingBlockVoteRequest23 from smores17
trying to fetch MissingBlockVoteRequest23 from odessa
trying to fetch MissingBlockVoteRequest23 from Passivation
trying to fetch MissingBlockVoteRequest23 from mm
trying to fetch MissingBlockVoteRequest23 from noNation20
trying to fetch MissingBlockVoteRequest23 from snjor671
trying to fetch MissingBlockVoteRequest23 from CPRCP 17
requesting block with votes for height 16234168, interval=8923
trying to fetch BlockWithVotesRequest37 from Bernadette
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16234168
broadcasting message: BlockVote19 to 2574
added new out-of-cycle node to queue: 2774...06ea
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent rest$
broadcasting new-verifier vote for f6f8...9eee at height 16234167
broadcasting message: NewVerifierVote21 to 2574
trying to fetch MissingBlockVoteRequest23 from Unique1
trying to fetch MissingBlockVoteRequest23 from Marry 47
trying to fetch MissingBlockVoteRequest23 from Carter
trying to fetch MissingBlockVoteRequest23 from xiaoi002
trying to fetch MissingBlockVoteRequest23 from James Buchanan
trying to fetch MissingBlockVoteRequest23 from 30Nyain30
trying to fetch MissingBlockVoteRequest23 from Tesla H
trying to fetch MissingBlockVoteRequest23 from Detail POD 17
trying to fetch MissingBlockVoteRequest23 from JS001
trying to fetch MissingBlockVoteRequest23 from Pactine14
requesting block with votes for height 16234168, interval=13726
trying to fetch BlockWithVotesRequest37 from IOI-Again
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16234168

Then loops with the same logs until I restarted.

(This time, it was showing yellow on nyzo.co homepage.)

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 20, 2022

@EggPool Thank you for this. You've included a lot of really helpful information.

The hardware should not be a concern in either case. All of our verifiers use the $20/month Digital Ocean "basic" droplets (4GB RAM, 2 CPU), and we have not seen any issues with RAM or CPU.

The high CPU load is a really interesting bit of information and will likely help us pinpoint what's happening.

The status responses show that block votes are not coming in at all in the second and third example. The first example was a little different: block votes were coming in, and it looks like it may have caught up on its own after a temporary communication outage. The second example was over 11,000 blocks behind, though, which is almost a full day. And the latest blocks had no votes whatsoever. The third example, while you captured it only 172 blocks behind, was showing yellow, not red. The fact that you could communicate with it, but it was receiving no votes whatsoever for the latest blocks, shows that there was something specific to receiving/processing UDP.

The fact that the restart fixed the problem could point to either an issue with the UDP message queue or that node joins associated with the restart corrected the problem. High CPU usage before the restart suggests that there's a process in a bad state, but it's also possible that the high CPU usage was due to some recovery process working harder than expected.

On example 2 (the one 11,000 blocks behind), the exceptions on MissingBlockRequest25 are due to blacklisting. The blacklisting likely persisted for a little while after restart, but as you noted, the messages no longer needed to be sent.

The white display on the nyzo.co homepage, while the verifier status is yellow, is actually very interesting. I'll look at the logic there to see why that's happening. This should be an easy fix.

Thank you again for all this information. It's very helpful, and we'll keep digging and let you know what we find.

@EggPool
Copy link
Author

EggPool commented Apr 20, 2022

Another verifier, likely same issue. Vps with 6 cores, 16 Gb ram. Cpu load was at 6 out of 6, down to 1.3 after rsync and restart. Verifier eating all cpu while stuck then. No ram issue.

This same verifier, right now, was 50 blocks behind. Still tracking, requesting blocks with votes, but late.
No 100% cpu this time (yet)

Yellow status on its detail page, still white on nyzo.co homepage.

IP address: 
last queried: 57.3 seconds ago
nickname: 
version: 610
ID: 
mesh: 4920 total, 2516 in cycle
cycle length: 2533
transactions: 0
retention edge: 16230868
trailing edge: 16230892
frozen edge: 16241026 (ONEMON2/14...9e)
open edge: 16241050
blocks transmitted/created: 6/11554
block vote: insufficient count=214; lowest-scored; h=16241027; 235e...83a9
last join height: 16240109
last removal height: 16240027
receiving UDP: yes
- h: +1, n: 3, v: 1693(1693)
- h: +2, n: 0, v: 0(0)
- h: +3, n: 0, v: 0(0)
...
- h: +22, n: 0, v: 0(0)
- h: +23, n: 0, v: 0(0)
- h: +24, n: 0, v: 0(0)

Resynced and restarted, but given startup time, it's still like 30 blocks behind and drifting away. Only moves up one by one, does not batch catchup like usually once it gets to the blocks he saw live.
20-30% cpu use only, plenty of ram.
Quite recently joined verifier, he'll likely drop soon.

After restart and a while:

IP address: 
last queried: 74.8 seconds ago
nickname: 
version: 610
ID: 
mesh: 4920 total, 2514 in cycle
cycle length: 2533
transactions: 0
retention edge: 16230980
trailing edge: 16231004
frozen edge: 16241138 (MagicQ/77...72)
open edge: 16241169
blocks transmitted/created: 1/34
block vote: insufficient count=10; h=16241139; undetermined
last join height: 16240109
last removal height: 16240027
receiving UDP: yes
- h: +1, n: 2, v: 1050(1050)
- h: +2, n: 0, v: 0(0)
- h: +3, n: 0, v: 0(0)
...
- h: +29, n: 0, v: 0(0)
- h: +30, n: 0, v: 0(0)
- h: +31, n: 0, v: 0(0)

...
Now stuck again

added new out-of-cycle node to queue: 247e...f6ef
nodejoin_from 51.161.48.254 247e...f6ef Super5282
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16241171
broadcasting message: BlockVote19 to 2561
added new out-of-cycle node to queue: e397...cf42
nodejoin_from 138.68.189.3 e397...cf42 xyz
nodejoin_from 157.245.98.186 79cb...3d37 Mr
nodejoin_from 161.35.97.208 68cf...1a91 barrick
nodejoin_from 51.161.59.248 0b66...2474 pvip024
nodejoin_from 167.71.85.49 da0b...90e7 Disgusted1GqPM
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent restart of this verifier. If this continues for more than 50 blocks, please check your verifier configuration.
trying to fetch MissingBlockVoteRequest23 from VR05b
trying to fetch MissingBlockVoteRequest23 from snjor615
trying to fetch MissingBlockVoteRequest23 from zksoft22
trying to fetch MissingBlockVoteRequest23 from minerking5vc3MBa8
trying to fetch MissingBlockVoteRequest23 from qq
trying to fetch MissingBlockVoteRequest23 from So COMIX0
trying to fetch MissingBlockVoteRequest23 from europe2
trying to fetch MissingBlockVoteRequest23 from slik4
trying to fetch MissingBlockVoteRequest23 from *lufe3
trying to fetch MissingBlockVoteRequest23 from Topsuuuuu04
requesting block with votes for height 16241171, interval=7207
trying to fetch BlockWithVotesRequest37 from UPPPU14
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16241171
broadcasting message: BlockVote19 to 2561
added new out-of-cycle node to queue: 9a82...ac10
nodejoin_from 51.161.48.246 9a82...ac10 Super8761
nodejoin_from 51.38.132.116 235f...3796 Toyota
added new out-of-cycle node to queue: 14fe...0b7a
nodejoin_from 3.234.241.174 14fe...0b7a 0xFury20
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent restart of this verifier. If this continues for more than 50 blocks, please check your verifier configuration.
trying to fetch MissingBlockVoteRequest23 from mrle162
trying to fetch MissingBlockVoteRequest23 from D_e_L_i04
trying to fetch MissingBlockVoteRequest23 from Bearcat3
trying to fetch MissingBlockVoteRequest23 from aes2019
trying to fetch MissingBlockVoteRequest23 from MAMBA1
trying to fetch MissingBlockVoteRequest23 from Websites_Dislike
trying to fetch MissingBlockVoteRequest23 from hcbhb
trying to fetch MissingBlockVoteRequest23 from OneNight
trying to fetch MissingBlockVoteRequest23 from P-Rich15
trying to fetch MissingBlockVoteRequest23 from Burgess
requesting block with votes for height 16241171, interval=7489
trying to fetch BlockWithVotesRequest37 from mrle310
added new out-of-cycle node to NodeManager: 37c6...1dbe
added new out-of-cycle node to NodeManager: a571...beec
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16241171
broadcasting message: BlockVote19 to 2561
added new out-of-cycle node to queue: a47a...841a
nodejoin_from 54.216.161.130 a47a...841a StakeFish5
nodejoin_from 66.70.190.52 df39...63c6 FLYMOON12

etc.

Usual cure in this case is running a backup verifier for the same id, since in previous similar cases, switching the verifier to another ip only delays the issue. The new verifier also gets stuck in the same way.

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 20, 2022

One note: to maintain score, just set a backup vote source identifier to any verifier that isn't having trouble tracking, yours or another: http://tech.nyzo.co/setupInstructions/preferences#:~:text=fallback_vote_source_identifier

@EggPool
Copy link
Author

EggPool commented Apr 20, 2022

One note: to maintain score, just set a backup vote source identifier to any verifier that isn't having trouble tracking, yours or another: http://tech.nyzo.co/setupInstructions/preferences#:~:text=fallback_vote_source_identifier

Yep, It has - like all my verifiers - but is raising in scores anyways.

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 20, 2022

Yep, It has - like all my verifiers - but is raising in scores anyways.

This would make sense if it cannot receive any votes at all, because it does not receive the vote that it needs to use the fallback.

This is the real problem, and solving this problem would solve everything:

- h: +29, n: 0, v: 0(0)
- h: +30, n: 0, v: 0(0)
- h: +31, n: 0, v: 0(0)

For some reason, your verifier is simply not receiving votes.

I'm looking at scores now on Argo 746 and looking at the information the nyzo.co server has to see if there's any indication why your verifier isn't getting votes. The nyzo.co server doesn't have any special access, but it does do a lot of queries.

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 20, 2022

And the specific problem is this: v: 0(0). The n: 0 part isn't a concern, because the block is kept in the disconnected-blocks map, which is not currently displaying in the status.

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 20, 2022

I'm pretty sure I found the verifier you described. It was running version 610, joined the cycle not too long ago, is having tracking problems, and was displaying as white on the nyzo.co cycle page.

I added some extra logging to see why it was displaying as white on the cycle page, and it displayed as yellow on the cycle page after the restart. The logic on the web server for yellow is:

return verifierUnhealthyIdentifiers.contains(identifier) &&
                !verifierStatusHealthyIpToIdentifierMap.values().contains(identifier);

This logic is meant to display the health of the more healthy of multiple verifiers when multiple verifiers are present for a particular identifier on multiple IP addresses. So, it appears that the nyzo.co server had some invisible entry that was not displaying on the status page but was convincing it that the verifier was healthy. I'd really like to understand this, but I wasn't able to find the particular cause in this case. I'll add additional monitoring in the next few days to surface this discrepancy next time it happens.

For this particular verifier, I currently see a score of about 25,000 to 30,000 across the verifiers I've checked. The removal threshold is 148,116. So, you still have a little more than 1.5 days of non-voting before removal. I recommend keeping the verifier on the current IP address and restarting it one or two more times.

I'm next going to add some logging to the Argos and Killr to see if they are sending votes to your verifier. If they are sending votes, then this is likely a UDP message queue issue. If they are not, then this is likely a node-join issue.

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 21, 2022

I've been watching your verifier for the past several hours, and it has gone from consistent 0 votes at many block levels to, right now, receiving around 2500 votes per block level. With the current cycle length, the threshold is around 1,900 votes to freeze a block.

I've also seen many heights where the block votes are somewhere in between: 600, 1000, or 1500 votes.

The best I can figure is that a lot of incoming UDP is getting dropped, either deliberately as misguided protection, or inadvertently, by the hosting provider.

If you want to try to save this one, I suggest moving it to a totally different hosting provider for a little while to get its score back. If the information I have is correct, it appears this one is currently on Contabo. If you have a Digital Ocean account, I highly recommend them for network stability, and you could get one of their $20/month Droplets for just a little while and destroy it when the verifier is no longer in danger.

Another solution that would be really nice would be to begin storing block votes on the client and configuring the client to respond to block-with-votes requests. We also need to see whether the chain filler can/should be stepping in to freeze entire chain sections. Right now, this verifier is receiving enough votes to freeze several blocks around the cycle's frozen edge, and I think that should be sufficient for the chain filler to do its job, but I'd need to review the code again to confirm.

@EggPool
Copy link
Author

EggPool commented Apr 21, 2022

Thanks for digging.
Will add some thoughts later on.

About the Yellow on home and invisible entry: This verifier was on another ip when in-queue, and was migrated shortly after it joined (Many other verifiers do this, so this is likely a larger issue than my alone. reborn experienced it as well with a migrated verifier)

At that time I did setup the new verifier, it went white, I stopped the old one.
The old one went red on the nyzo.co status page then disappeared after a while.

Just restarted the verifier, it's keeping track for now.

@EggPool
Copy link
Author

EggPool commented Apr 21, 2022

Update: The ip the current verifier is using, was used by a backup in-cycle verifier before.
Could be another reason for a long lasting erroneous link somewhere, if it was related to the other (white) in-cycle.

@EggPool
Copy link
Author

EggPool commented Apr 21, 2022

The second verifier (v605) was stuck again, high cpu load.

IP address: 
last queried: 77.1 seconds ago
nickname: 
version: 605
ID: 
mesh: 9445 total, 2509 in cycle
cycle length: 2533
transactions: 0
retention edge: 16235461
trailing edge: 16235485
frozen edge: 16245617 (Detachment/08...58)
open edge: 16246787
blocks transmitted/created: 0/2
block vote: insufficient count=860; h=16245618; dc15...30f8
last join height: 16245177
last removal height: 16243448
receiving UDP: yes
- h: +1, n: 1, v: 860(860)
- h: +2, n: 0, v: 0(0)
- h: +3, n: 0, v: 0(0)
...
- h: +1166, n: 0, v: 2458(2458)
- h: +1167, n: 0, v: 2165(2165)
- h: +1168, n: 0, v: 0(0)

I just added this in preferences, hoping to reduce cpu load when stuck and recover faster:

maximum_concurrent_connections_per_ip=3
concurrent_connection_throttle_threshold=200
maximumConcurrentConnections=400

After rsync and restart, moves up, but dritfs for lack of votes.

IP address: 
last queried: 3.2 seconds ago
nickname: 
version: 605
ID: 
mesh: 9443 total, 2510 in cycle
cycle length: 2532
transactions: 0
retention edge: 16236763
trailing edge: 16236787
frozen edge: 16246917 (Winwin25742/1c...da)
open edge: 16246942
blocks transmitted/created: 1/77
block vote: insufficient count=0; h=16246918; undetermined
last join height: 16245177
last removal height: 16243448
receiving UDP: yes
- h: +1, n: 3, v: 854(854)
- h: +2, n: 0, v: 0(0)
- h: +3, n: 0, v: 0(0)
...
- h: +23, n: 0, v: 0(0)
- h: +24, n: 0, v: 0(0)
- h: +25, n: 0, v: 0(0)

Please note that here, I do take time to collect data and report every incident over a short time frame, but this a part only of what is happening at large scale since months, to several users, small or larger operators, fighting day after day to keep such cases in-cycle or just letting them go if they don't see soon enough or end up simply giving up.
This is not new, not isolated, not me alone.

@reb0rn21
Copy link

Yeah I seen this issue but for me was OOM on 1GB VPS with gr8 cpu and 1GB swap, I moved them to other VPS the low ram should not be issue maybe VPS had some memory issue i can not see

CPU load was same stuck at 100% for 1 core also main nyzo page show verifier as white, only detection come from sentinel still white but all zero all the time

@EggPool
Copy link
Author

EggPool commented Apr 22, 2022

One from here and several different ones stuck.
2 different providers involved.

Just pasting some info from one of the new ones.

IP address: 
last queried: 58.8 seconds ago
nickname: 
version: 611002
ID: 
mesh: 4429 total, 2502 in cycle
cycle length: 2530
transactions: 0
retention edge: 16245072
trailing edge: 16245096
frozen edge: 16255220 (heisenberg0/d2...6f)
open edge: 16258570
blocks transmitted/created: 167/425425
block vote: insufficient count=1890; h=16255221; 2899...f1e5
last join height: 16250243
last removal height: 16254578
receiving UDP: yes
- h: +1, n: 3, v: 1890(1890)
- h: +2, n: 0, v: 1731(1731)
- h: +3, n: 0, v: 1772(1772)
...
- h: +3347, n: 0, v: 944(944)
- h: +3348, n: 0, v: 1243(1243)
- h: +3349, n: 0, v: 990(990)

vps with 4 cpu, 8 Gb ram.

Logs

removed node from new out-of-cycle queue due to size
nodejoin_from 54.220.14.207 b160...84df Disperze29
nodejoin_from 142.44.145.66 5962...ac7f automotives34
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16255222
broadcasting message: BlockVote19 to 2549
nodejoin_from 51.161.28.73 8682...01d5 burr174
added new out-of-cycle node to queue: 0627...8253
removed node from new out-of-cycle queue due to size
nodejoin_from 198.50.165.82 0627...8253 cryptobull1083
nodejoin_from 95.217.216.76 7f79...6a1c Gogirl A5
added new out-of-cycle node to queue: 194d...f127
nodejoin_from 63.32.99.98 194d...f127 Mercury33
added new out-of-cycle node to queue: 94e1...2bba
removed node from new out-of-cycle queue due to size
nodejoin_from 104.200.149.21 94e1...2bba roor
trying to fetch BlockRequest11 from Houston
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent restart of this verifier. If this continues for more than 50 blocks, please check your verifier configuration.
broadcasting new-verifier vote for 814c...e019 at height 16255221
broadcasting message: NewVerifierVote21 to 2549
added new out-of-cycle node to queue: 24ea...f49c
removed node from new out-of-cycle queue due to size
nodejoin_from 18.119.150.92 24ea...f49c BartestneT16
nodejoin_from 134.122.24.210 a132...726e mototoya003
nodejoin_from 18.202.213.14 69f7...efb3 KalpaTech21
added new out-of-cycle node to queue: d6e6...23c4
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.48.15 d6e6...23c4 Angry4077
nodejoin_from 46.101.89.80 a508...49ca open
added new out-of-cycle node to queue: 6ca2...cc1b
removed node from new out-of-cycle queue due to size
nodejoin_from 51.89.88.154 6ca2...cc1b okay2216
nodejoin_from 54.154.138.196 3335...d9a7 KalpaTech6
nodejoin_from 52.214.143.100 4f97...a8f3 StakeFish38
added new out-of-cycle node to queue: 6038...810e
removed node from new out-of-cycle queue due to size
nodejoin_from 192.99.132.222 6038...810e KosherEYWu0
added new out-of-cycle node to queue: ab01...cad3
removed node from new out-of-cycle queue due to size
nodejoin_from 46.161.12.42 ab01...cad3 cookie12.42
nodejoin_from 54.228.69.47 073f...b81b B-Harvest27
added new out-of-cycle node to queue: 7768...cd08
removed node from new out-of-cycle queue due to size
nodejoin_from 51.75.81.46 7768...cd08 mototoya
trying to fetch MissingBlockVoteRequest23 from VotingMe
trying to fetch MissingBlockVoteRequest23 from ff
trying to fetch MissingBlockVoteRequest23 from Mitnick
trying to fetch MissingBlockVoteRequest23 from CITIZEN12
trying to fetch MissingBlockVoteRequest23 from mrle144
trying to fetch MissingBlockVoteRequest23 from VUL1RT
trying to fetch MissingBlockVoteRequest23 from crypto071
trying to fetch MissingBlockVoteRequest23 from xOlife6342
trying to fetch MissingBlockVoteRequest23 from Nget12089
trying to fetch MissingBlockVoteRequest23 from 10Dollar16
requesting block with votes for height 16255222, interval=38405
trying to fetch BlockWithVotesRequest37 from TMac
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16255222
broadcasting message: BlockVote19 to 2549
nodejoin_from 51.38.131.249 f09d...bd5e tedcrypto16
added new out-of-cycle node to queue: 5ee0...c265
removed node from new out-of-cycle queue due to size
nodejoin_from 52.49.83.107 5ee0...c265 Cypher Core25
added new out-of-cycle node to queue: c5fb...27b6
nodejoin_from 51.161.60.189 c5fb...27b6 pvip221
added new out-of-cycle node to queue: b84a...69c4
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.40.193 b84a...69c4 Super25180
added new out-of-cycle node to queue: f58e...4a6a
removed node from new out-of-cycle queue due to size
nodejoin_from 51.75.81.63 f58e...4a6a mototoya
nodejoin_from 52.86.88.153 f67f...2e35 Alphabet21
nodejoin_from 206.246.104.70 b25f...9b48 cj
added new out-of-cycle node to queue: 9ea0...153e
removed node from new out-of-cycle queue due to size
nodejoin_from 51.89.49.15 9ea0...153e boogie168648
added new out-of-cycle node to queue: 895d...0016
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.40.210 895d...0016 Super49616
nodejoin_from 95.217.19.165 1665...e899 X1986X14
nodejoin_from 157.245.11.191 8782...0a28 gateway011
added new out-of-cycle node to queue: e2df...c94a
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.60.205 e2df...c94a pvip237
added new out-of-cycle node to queue: d05c...873a
removed node from new out-of-cycle queue due to size
nodejoin_from 139.99.168.166 d05c...873a greglem
nodejoin_from 207.148.31.149 a8ca...d9ca Randolph Scott 3
trying to fetch BlockRequest11 from mrle377
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent restart of this verifier. If this continues for more than 50 blocks, please check your verifier configuration.
trying to fetch MissingBlockVoteRequest23 from Masoud 10
trying to fetch MissingBlockVoteRequest23 from Fushimi Inari
trying to fetch MissingBlockVoteRequest23 from tomtom05-1126
trying to fetch MissingBlockVoteRequest23 from D-Email15
trying to fetch MissingBlockVoteRequest23 from Telegram20
trying to fetch MissingBlockVoteRequest23 from Stone
trying to fetch MissingBlockVoteRequest23 from noway36
trying to fetch MissingBlockVoteRequest23 from heman15
trying to fetch MissingBlockVoteRequest23 from KatBusch9
trying to fetch MissingBlockVoteRequest23 from Baruch
requesting block with votes for height 16255222, interval=24000
trying to fetch BlockWithVotesRequest37 from NV1080
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16255222
broadcasting message: BlockVote19 to 2549
Exception sending message MissingBlockVoteRequest23 to 157.230.8.104:9444: exception: Connection reset; java.net.SocketOutputStream:115, java.net.SocketOutputStream:143, co.nyzo.verifier.Message$1:231, java.lang.Thread:748
nodejoin_from 91.121.216.89 8b6d...bd41 ade4..80cc
nodejoin_from 164.132.191.204 34b2...fd1b Mitnick
nodejoin_from 18.237.59.96 40de...3fc6 cros-nest32
nodejoin_from 141.94.21.251 6c31...076d Launch9
added new out-of-cycle node to queue: 2999...244a
removed node from new out-of-cycle queue due to size
nodejoin_from 3.16.164.140 2999...244a PigCash0
nodejoin_from 67.205.156.246 e4b9...fbe7 tech
added new out-of-cycle node to queue: 70f2...c599
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.121.213 70f2...c599 tomtom1214
added new out-of-cycle node to queue: ea62...1aea
removed node from new out-of-cycle queue due to size
nodejoin_from 34.239.151.108 ea62...1aea 0xFury46
nodejoin_from 15.222.69.110 3055...1e8b Provalidator5
added new out-of-cycle node to queue: 5846...b239
nodejoin_from 46.161.12.154 5846...b239 crypto039
trying to fetch BlockRequest11 from idHLyes17
Top verifier is null. This is likely due to vote scattering, a transition period between voting windows, or recent restart of this verifier. If this continues for more than 50 blocks, please check your verifier configuration.
trying to fetch MissingBlockVoteRequest23 from Xmoon01
trying to fetch MissingBlockVoteRequest23 from incent5uk
trying to fetch MissingBlockVoteRequest23 from Launch9
trying to fetch MissingBlockVoteRequest23 from PeterZoller
trying to fetch MissingBlockVoteRequest23 from HypeGO18
trying to fetch MissingBlockVoteRequest23 from Jabba013
trying to fetch MissingBlockVoteRequest23 from uc
trying to fetch MissingBlockVoteRequest23 from EXPRESS24
trying to fetch MissingBlockVoteRequest23 from DimiO73
trying to fetch MissingBlockVoteRequest23 from Guiacum
requesting block with votes for height 16255222, interval=17526
trying to fetch BlockWithVotesRequest37 from Hulama15
waiting for message queue to clear from thread [Verifier-mainLoop], size is 1
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16255222
broadcasting message: BlockVote19 to 2549
nodejoin_from 195.201.42.31 9ba2...7c6b NewChapter
added new out-of-cycle node to queue: 164e...41e1
removed node from new out-of-cycle queue due to size
nodejoin_from 198.50.165.66 164e...41e1 cryptobull1067
nodejoin_from 5.135.192.124 697e...54de Nothing
added new out-of-cycle node to queue: a735...2b7e
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.72.18 a735...2b7e automotives18
added new out-of-cycle node to queue: 9cb5...b310
removed node from new out-of-cycle queue due to size
nodejoin_from 51.161.121.230 9cb5...b310 tomtom1231
nodejoin_from 95.217.17.188 fa12...900f MAMBA17
added new out-of-cycle node to queue: 76c9...94d1
nodejoin_from 198.50.187.185 76c9...94d1 Nutka302
added new out-of-cycle node to queue: 4737...e251
removed node from new out-of-cycle queue due to size
nodejoin_from 142.44.145.67 4737...e251 automotives35

Before stalling on this block, it was stalling on many previous ones.
No exception apart from the message sending one.
No huge cpu load here.

@EggPool
Copy link
Author

EggPool commented Apr 22, 2022

Side note:
I did not move the verifier you watched more closely, yet.
It's white, tracking - at least most of the time I can see. No more stuck since.
It has a fallback vote source.

Yet, its performance score continue to slowly raise.

@EggPool
Copy link
Author

EggPool commented Apr 22, 2022

Update on the 2cpu, 4gb one.
Stuck again (not contabo here)

IP address: 
last queried: 75.2 seconds ago
nickname: 
version: 605
ID: 
mesh: 9460 total, 2503 in cycle
cycle length: 2524
transactions: 0
retention edge: 16253441
trailing edge: 16253465
frozen edge: 16263583 (az/50...c2)
open edge: 16263628
blocks transmitted/created: 3/4858
block vote: leading; h=16263584; 29d9...257e
last join height: 16260372
last removal height: 16263446
receiving UDP: yes
- h: +1, n: 2, v: 2468(2468)
- h: +2, n: 0, v: 0(0)
- h: +3, n: 0, v: 0(0)
...
- h: +34, n: 0, v: 0(0)
- h: +35, n: 0, v: 0(0)
- h: +36, n: 0, v: 0(0)
trying to fetch BlockWithVotesRequest37 from Imcyclein8
trying to fetch BlockRequest11 from Indue7
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16263594
broadcasting message: BlockVote19 to 2553
added new out-of-cycle node to queue: 5c1e...7186
trying to fetch BlockRequest11 from noway74
trying to fetch MissingBlockVoteRequest23 from Enygo5
trying to fetch MissingBlockVoteRequest23 from Xmoon19767
trying to fetch MissingBlockVoteRequest23 from Marconi18
trying to fetch MissingBlockVoteRequest23 from bull0009
trying to fetch MissingBlockVoteRequest23 from avalon689597
trying to fetch MissingBlockVoteRequest23 from BabyK6
trying to fetch MissingBlockVoteRequest23 from EHBTC19
trying to fetch MissingBlockVoteRequest23 from NowFuture6
trying to fetch MissingBlockVoteRequest23 from Sam_FTX 25
trying to fetch MissingBlockVoteRequest23 from DANCED21
requesting block with votes for height 16263594, interval=6818
trying to fetch BlockWithVotesRequest37 from ttt15
top verifier eb51...55e6 has 2351 votes with a cycle length of 2524 (93.1%)
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16263594
broadcasting message: BlockVote19 to 2553
trying to fetch BlockRequest11 from burrr02
trying to fetch MissingBlockVoteRequest23 from Tiesto06
trying to fetch MissingBlockVoteRequest23 from Input11
trying to fetch MissingBlockVoteRequest23 from Mask Jame 3
trying to fetch MissingBlockVoteRequest23 from covn3
trying to fetch MissingBlockVoteRequest23 from PACITES0
trying to fetch MissingBlockVoteRequest23 from HOTtk 5
trying to fetch MissingBlockVoteRequest23 from Dida11
trying to fetch MissingBlockVoteRequest23 from Marconi20
trying to fetch MissingBlockVoteRequest23 from Worth18343
trying to fetch MissingBlockVoteRequest23 from mototoya007
requesting block with votes for height 16263594, interval=5307
trying to fetch BlockWithVotesRequest37 from hYpOnYzO_017
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 16263594
broadcasting message: BlockVote19 to 2553
trying to fetch BlockRequest11 from 🥝
top verifier eb51...55e6 has 2350 votes with a cycle length of 2524 (93.1%)
trying to fetch MissingBlockVoteRequest23 from PRADA9
trying to fetch MissingBlockVoteRequest23 from Hummer Kotu
trying to fetch MissingBlockVoteRequest23 from D-Email4
trying to fetch MissingBlockVoteRequest23 from pvip107
trying to fetch MissingBlockVoteRequest23 from 047d...5009
trying to fetch MissingBlockVoteRequest23 from БалканскаADSITz
trying to fetch MissingBlockVoteRequest23 from Freyja
trying to fetch MissingBlockVoteRequest23 from barboraQJ3un
trying to fetch MissingBlockVoteRequest23 from HertoHim 17
trying to fetch MissingBlockVoteRequest23 from Villa4
requesting block with votes for height 16263594, interval=4027
trying to fetch BlockWithVotesRequest37 from Hummer Kotu

Rsync'd, restarted, tracking again.

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 22, 2022

I can't speak to all of these issues, and I don't know the details of the configurations, but I have been watching the one verifier closely, and I can tell you what I've seen there.

When the verifier was not receiving any block votes, I added additional logging to Killr, Argo 746, and Argo 752. They were all sending block votes to the other verifier for every height, and they were sending to the correct IP address. So this wasn't some node-join connectedness issue related to a changed IP. The votes were being sent, and Killr, Argo 746, and Argo 752 were having no problems communicating with one another and with the rest of the cycle. But the votes were not being received on the verifier that was having trouble tracking.

Yesterday, I was also seeing the score continue to creep up, despite the verifier tracking well. I did, however, see that the score is slightly lower (better) this morning. The current score I'm seeing is 54462 on Killr, 67179 on Argo 746, and 53339 on Argo 752. So, it appears that outbound UDP issues persisted for at least 24 hours, and I'm unsure if they are resolved yet.

Killr is in Digital Ocean's AMS3 (Amsterdam) data center. Argo 746 is in Digital Ocean's SGP1 (Singapore) data center. Argo 752 is in Digital Ocean's NYC3 (New York) data center. They are all running on Digital Ocean's "basic" shared CPU plan with a "regular" (not premium) CPU, 4GB RAM, 80GB of SSD storage, 2 CPUs, and 4TB of transfer. In the United States, this is $0.03/hour or $20/month.

The root of the problem on that verifier is an issue with UDP communication, both inbound and outbound. But these are not problems that we can fix -- they appear to be issues outside the Java application. If a verifier does not have reliable communication with the rest of the cycle, it will not be able to participate effectively in consensus, and it cannot be allowed to stay in the cycle over the long term. This is why we have performance scores, and removals due to performance scores have been essential to keeping the cycle healthy.

We can work on some of the issues with resynchronization. The two biggest will likely be a watchdog for all important threads and ensuring that the mechanism of freezing chain sections always works when two blocks can be frozen at end of the chain. I'm thinking the chain section mechanism may only be active at startup now, but I can't think of a reason, right now, that it couldn't be active all the time.

We can also start new cycle joins with the maximum negative score instead of a score of zero. This will give them more room for working out issues soon after they join. One of the most disappointing things I see is verifiers that drop shortly after they join.

Based on what I've seen the past few days, Contabo and OVH seem to have the most problems with UDP communication. I think it would be wise to avoid these providers for verifiers.

@EggPool I know that you have verifiers on multiple providers, but I would like you to try Digital Ocean specifically, if that's an option, to see what it looks like when a verifier has truly stable UDP communication. UDP, by its very design, is not guaranteed. As we see from the differences in performance scores on Argo 746, reliability of UDP will vary based on geographical location. But Argo 746 is also a really good example of how stable a verifier can be when it is geographically distance from much of the rest of the cycle. We have had Argo 746 in Singapore for more than 2 years now, and it has never had a problem with performance scores.

All of our setups on Digital Ocean follow the instructions we posted on tech.nyzo.co: https://tech.nyzo.co/setupInstructions/verifier. We always use the latest Ubuntu LTS version available when we build a verifier.

Our preferences files all contain a single option (block_file_consolidator=delete). This option is now irrelevant, due to the changes in version 622: https://tech.nyzo.co/releaseNotes/v622. So, if you want your verifier configured exactly like ours, you don't need a preferences file at all.

I'd really like to see how this works out for you. We spend almost no time at all maintaining our verifiers. When we restart them, it's either to test new code or update to a newly released version. Of our 10 verifiers, we have 4 that are still running version 600. Each of these verifiers has transmitted more than 2900 blocks since its last restart, and each of these verifiers has produced more than 7.64 million blocks since its last restart. In the right environment, the Nyzo verifier is ridiculously stable, and it has been for quite some time.

If you can try out a verifier on Digital Ocean, I'd really like to see how it works for you. I know you and a lot of other people are frustrated right now, but it really doesn't have to be this way.

Of course, we don't want all of the Nyzo cycle to move to Digital Ocean. This is just one provider that we know provides consistent UDP ingress and egress. We had similarly good experiences on AWS, but bandwidth was too expensive. I'm sure other good providers could be identified by looking at verifiers with long uptimes in the cycle.

@EggPool
Copy link
Author

EggPool commented Apr 22, 2022

Thanks.
Tbh, I don't consider myself frustrated.

What I'm aiming here is to give you some real time feedback of what it's like to operate a verifier for a regular user, having to monitor and restart every now and then, have them yellow and not showing as such on the home page, have it drop for bad perfs despite all you can do, resync, restart, fallback votes, migrating, backup, and hearing in return "the system works as intended".
Trying to bridge part of what I saw for months from many users on the discord, and the much more limited feedback you can have from your own verifiers and cycle analysis, from the outside.
No doing this because I feel frustrated, but so problems can be precisely identified, worked on and solved.
If all works as intended, there would be nothing to fix, nothing explaining what most users experienced.

As for D.O:
Not everyone can run $20/Month verifiers. Many use lower end offers, more like $4-$5 iirc, that worked well in the past, and work well for other similar tasks.
I do agree moving this specific verifier to DO would more than likely solve its current issue. I may try it in a few days.
I also have verifiers that require zero maintenance btw.
However I don't think it's a solution. To me, this is a warning sign of more profound issues, and this would hide important future troubles.
Running many verifiers on different providers is a way for me to see more issues than just using an ideal setup.

Last for now, even if we conclude that some specific providers end up dropping udp traffic (and there could be a good reason why) from time to time (not all the time), knowing this happens and gets the verifier stuck, can lead to better warning, backup/degraded mode and recovery mechanisms, that could avoid most of the drops from "regular" users, or at least show them clearly what is going on.

I need some time to gather and detail higher level thoughts in a synthetic way.

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 22, 2022

For one very small piece of this, I can offer a fix. The following change was just deployed to the nyzo.co web server:

public static boolean identifierIsUnhealthy(ByteBuffer identifier) {
    return verifierUnhealthyIdentifiers.contains(identifier);
}

There should be no more instances of a stuck verifier not displaying as yellow on the nyzo.co homepage.

@n-y-z-o
Copy link
Owner

n-y-z-o commented Apr 22, 2022

For those who haven't followed this whole thread, the previous version of the method was:

public static boolean identifierIsUnhealthy(ByteBuffer identifier) {
    return verifierUnhealthyIdentifiers.contains(identifier) &&
        !verifierStatusHealthyIpToIdentifierMap.values().contains(identifier);
}

This would sometimes leave a phantom entry that would cause a verifier to show as white on the nyzo.co homepage due to an IP switch. The new version may result in some false positives (yellow when there is a healthy verifier), but it should completely eliminate the false negatives (white when the verifier is unhealthy).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants