Minimal fixes to address NCFP-10: Do not allow "red" in-queue verifiers anymore. #25

EggPool · 2020-07-29T14:26:12Z

Reference:
#22

By voting Yes to ncfp-10, current cycle acknowledged the issue with always red verifiers being able to remain in queue with no actual verifier running, just spamming nodejoins once in "nodes" file.

This patch does

persist the new communicationFailureCount metric in nodes file.
rebalance increment/reset of failure count by reseting communicationFailureCount on successful answer only, never on incoming messages alone.
log nodejoins messages in a more verbose way, so external scripts can use that info (ip and short-id) to auto-ban or rate limit spamming nodes.
adds some context in changelog.md

This version has been tested since several days on several verifiers (both in-cycle and in-queue).
v596 and 597 have been integrated since so this PR can be merge with no conflict.

A preview of what the eligible (> 30 days) nyzo queue would look like if present at https://nyzo.today/queue from a modded verifier.
https://nyzo.today/dropped is a searchable table of "dropped" verifiers (nodes that are in the original nodes file but dropped from the modded version due to being consistently inactive for too long)

The difference with current nyzo queue is striking, and no false positive was reported so far.
This is just a first step to address huge number of verifiers abusing current code, but in no way a definitive answer.
This only patches today's most urging issue in the most minimal and side effect proof way.

itsMarcoSolis · 2020-07-29T14:39:35Z

By comparing the queue before and after applying this fix it is evident that a major improvement in the joining process has been achieved. Not a single member of the community has reported a false positive so it's safe to asume this change won't impact the cycle on a negative manner.
I hope the core devs can merge this PR soon.

bh00ker · 2020-07-29T14:49:00Z

I fully support this proposed PR. Been following closely. I think it addresses NCFP-10 quite well.

m-burner · 2020-07-29T14:54:28Z

I'm against this PR. We need a vote from the whole community.
If someone decides to make changes that cut 3/4 of the queue, a general decision is needed.

bh00ker · 2020-07-29T15:04:24Z

I'm against this PR. We need a vote from the whole community.
If someone decides to make changes that cut 3/4 of the queue, a general decision is needed.

We voted that red verifiers shouldnt be allowed to participate in the queue / lottery. This addresses it, and well I might add.

orisica · 2020-07-29T15:35:04Z

Great work, about time we remove fake verifiers from the queue and resume Proof Of Diversity which is in accordance with cycle vote.

Syd-ai · 2020-07-29T15:44:26Z

Great work, it will definitely grants a better diversity to nyzo cycle. Looking forward for a merge 🙏

setanimals · 2020-07-29T15:47:03Z

Supported! TY Sy for all the great work on this.

l-u-f-o · 2020-07-29T16:42:09Z

Reviewed the code and test results. This PR has my full support. The cycle tx result of NCFP-10 also shows support by the community.

nsejourne · 2020-07-30T21:28:35Z

I want this PR to be merged !

krypdkat · 2020-08-03T14:06:58Z

LGTM 👍 tested for 4 days. No issue.

I think you should resolve the conflict, squash all commits into 1 and move the ChangeLog to somewhere else to make it easier for the dev to build the next version upon your PR 🙂

n-y-z-o · 2020-08-10T01:29:21Z

src/main/java/co/nyzo/verifier/NodeManager.java

@@ -96,7 +101,9 @@ private static void updateNode(byte[] identifier, byte[] ipAddress, int portTcp,
                if (portUdp > 0) {
                    existingNode.setPortUdp(portUdp);
                }
-                existingNode.markSuccessfulConnection();
+                if (isNodeJoinResponse) {


This would need to be reviewed in detail. It could likely have a negative effect on in-cycle connectivity.

Thanks for reviewing and commenting.
NCFP-10 shows the cycle considers this an important matter, with long lasting and destructive effect on the Nyzo cycle and concept itself.

Please don't make such grand claims. We know people are worked up about this, and we sincerely appreciate your involvement and effort, but a bunch of red nodes does not have a "long lasting and destructive effect on the Nyzo cycle and concept itself." Remember that the original entrance process was a simple FIFO queue. It was, in fact, sufficient, though not ideal. The lottery is light-years ahead of the FIFO queue. As long as you have any source of scarcity as a reference for entrance, and as long as no single entity can cheaply control significantly more than half of that source of scarcity for long enough (years, with the current cycle size), then the proof of diversity is safe.

IP addresses are a source of scarcity. A lot of people feel that you should have a verifier continually running behind those IP addresses. We don't. We feel that controlling an IP address is enough, and the current system already enforced that. We also want to eliminate wastefulness as much as possible, and forcing someone to run a verifier is wasteful. Additionally, running a single verifier that receives entire blocks of node-join requests would be incredibly easy. So, you're more likely to cut out the small-time operators' red nodes while the larger blocks of red nodes are more likely to set up a single verifier to back them (and you'd only need to forward incoming 9444 requests to that verifier, leaving the IP addresses otherwise usable). This would further reward the big blocks while pushing out more of the little guys.

Overall, eliminating red nodes from the lottery, while inadvertently catching a number of white nodes in the process, drastically reduced the diversity of the lottery and harmed the diversity of the system as a result. There are currently 3,374 white nodes on https://nyzo.today/dropped.

Your idea of IP-based lottery scoring is, however, a wonderful solution. When you really dig into the idea and start looking at all of the facets of it, it's absolutely brilliant and very much a Nyzo-esque solution. It's simple, it encourages diversity, and it encourages good behavior. These are all natural properties that just fall out of the solution, not things that have to be implemented with complex machinations. We will be releasing an NTTP soon to explore this further.

Please accept my apologies for the "grand claim".
With no feedback from you, no idea of what your stance on this issue - or if this was an issue at all, we tried our best to cope with what many saw as an effective attack on the queue (more than 50% joins) and therefore the cycle.
With the current cycle size indeed, it would take years. But who knows how many verifiers these groups already do control?
The first c-class ip network seen in the queue goes back to a pretty long time (jim tweeted at that time iirc, can't find the date back)

As for the destructive effect, it's also a psychological one on small operators and individuals.
How can you pay for a vps to add one ip to the queue when you see 10's of red c-classes already? This discourages all the small fishes, and I know it encouraged knowledgeable users to level up their game as well.

This PR was more of a first - temporary - answer, meant to be quickly applied and give time for a more thoughtful one (the lottery).
I'm aware that even with this PR, operators of full c-class could stay in the queue, just that would require more work and resources for them.

Your stance on wastefulness we had, but your stance of not necessarily needing to operate a full verifier is also an important data we did not have until now.
Speaking for myself, while I'm not fond of waste at all, I see the current working as hugely balanced toward isps (see odessa) or companies already owning lots of ips - for other uses - so their marginal cost to run a huge amount of in-queue (significantly more than half of the queue currently) is null.
This PR does not address that, only the c-class aware lottery can.

As for white verifiers caught in the process, the image nyzo.today/dropped shows is the result of 2 things:

drops due to failedCount

bans due to nodejoin spam
Excessive nodejoins spams are auto-banned on this system that was heavily spammed, this could play a role.

tl;dr: I do agree this PR is not a full-featured solution and has its flaws.
It was thought as a minimal patch and impactless (so I believed) temporary solution - quick to apply - so to regain the confidence of small and average actors that can not operate bunches of 256 ips, limiting the risk on the cycle, buying time to build a more complex solution (the lottery).
Thanks a lot for your answers, can't wait for the upcoming NTTP!

n-y-z-o · 2020-08-10T01:30:41Z

src/main/java/co/nyzo/verifier/NodeManager.java

@@ -462,11 +470,19 @@ private static void loadPersistedNodes() {
                    int portUdp = Integer.parseInt(split[3]);
                    long queueTimestamp = Long.parseLong(split[4]);
                    // long identifierChangeTimestamp = Long.parseLong(split[5]);  no longer used
+                    long communicationFailureCount = Long.parseLong(split[5]);
+                    if (communicationFailureCount > 1000) {


This is a smart check.

n-y-z-o · 2020-08-10T01:44:48Z

src/main/java/co/nyzo/verifier/Message.java

@@ -234,6 +233,7 @@ public void run() {
                            socket.setSoTimeout(1000);
                            response = readFromStream(socket.getInputStream(), socket.getInetAddress().getAddress(),
                                    message.getType());
+                            NodeManager.markSuccessfulConnection(hostNameOrIp);


This could also have a negative impact on in-cycle connectivity. During periods of stress, a verifier might be able to make a TCP connection but not be able to process and respond within the 1000ms timeout period. The behavior of this with large messages should be carefully considered, also.

Thanks a lot for the feedback.
The current working is overly optimistic regarding users using alternate code, and overly pessimistic regarding network ability.
As a result, once you're in it's way to easy to remain in "nodes" file and be lottery eligible.

Since a markSuccessfulConnection() does reset the failed count, and to drop a node you need 6 consecutive fails plus a timestamp threshold,
you don't need every request to end within the timeout or with a successful processing.
At worse, you just need just one real answer during 6 consecutive queries + timestamp threshold and all failed reset.
No matter the stress period or some long messages: it will be reset anyway for a real verifier.
Moreover, the only way to increment the "failed" count is to not open a socket on incoming message.
If the peer timeouts, failed is not even incremented.
Lastly, iirc an in-cycle that would drop of the nodes files that way will be re-added to the nodes files instantly on next nodejoin, so it looks no more harmful than an in-cycle being down for some time.

If this poses a real risk to in-cycle connectivity (runs since almost 2 weeks on some verifiers, 30% of the cycle)
then the early markSuccessfulConnection() should only be called for in-cycles, not in-queue.
I'm unsure of the computational cost for such a lookup.

As an additional feedback and second thought:
failedCount should only matter for in-queue verifiers.
In-cycle can be down for some time, they are backed up by sentinels, and their performance score is what have them voted out if they are off for a long time.
This is the main mechanism for dropping in-cycle, along with no block provided - be it by verifier or sentinel.

So failed count could only be relative in-queue, as a way to make sure that ip indeed hosts a verifier, with an open and working port.
Logic would then be to simply ignore markFailedCommunicationAttempt() for in-cycle verifiers.
In-cycle connectivity would then be preserved. Once in-cycle, a node is then less sensible to stress or DoS or temp issue.
All that matters is produce a block (verifier or sentinel) and perf metrics (long term working).

Would that cause some other side effects?

The failedCount is important for in-cycle verifiers because it eliminates verifiers that are no longer present. If we did not have this, we would still be sending messages to every IP ever associated with an in-cycle verifier.

This needs to be examined in the specific case of a cycle stall or near-stall. The cycle has seen no stress under this code, so it has not yet been tested. There is a delicate balance between pruning permanently dead connections quickly to eliminate excess traffic and allowing connections with temporary problems to recover efficiently.

When the blockchain is moving forward at a normal pace, everything is easy. When the cycle becomes distressed, that's when the danger happens, and that's what we always need to engineer for. This is exceptionally difficult, because simulating the current Nyzo mesh is impractical.

Thanks a lot for the feedback.

We were in the dark for some time with no hint and missed you a lot.
I now understand your concerns about these changes better.
Would address that by differentiating in-cycle from out of cycle in regard to MarkSuccessful, since danger comes from out of cycle, and in-cycle needs more failsafes.

Will answer more deeply in the second comment.

Node Ban Manager (nodejoin spam)

EggPool and others added 3 commits July 20, 2020 18:01

ncfp10-a

6db8fa5

595002: Persist communicationFailureCount plus nodejoin logging

ce7dc74

Merge upstream, 597001

eb76745

EggPool mentioned this pull request Jul 29, 2020

Too lenient in queue verifier proofs. #22

Open

597001 and PR switch instructions.

821ca72

EggPool added 3 commits August 4, 2020 09:41

Merge 599

2da14e6

Merge Changelog

3dbe8ca

Merge 600

85bfc45

n-y-z-o reviewed Aug 10, 2020

View reviewed changes

EggPool and others added 9 commits August 15, 2020 22:14

Merge 601

0c893b9

Merge 605

749b59f

Merge 606

8577cf2

Ban IPs sending too many nodejoin messages

fab84a6

Move node join counter/banning logic to manager

b1f6b5a

Change preference name for consistency

0c0f51d

Skip connection tracking if address is banned

b339aa6

Reduce to a single ban check

f1046b5

Merge 610

d1b94b6

EggPool and others added 5 commits November 4, 2021 18:41

Merge pull request #8 from blocksentinel/node-ban-manager

8644896

Node Ban Manager (nodejoin spam)

Version 611 Sync

cc569bb

Merge branch 'n-y-z-o-master'

752b100

Fix version

644b909

Add tunable catchup params from 'preferences' file

a896673

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal fixes to address NCFP-10: Do not allow "red" in-queue verifiers anymore. #25

Minimal fixes to address NCFP-10: Do not allow "red" in-queue verifiers anymore. #25

EggPool commented Jul 29, 2020

itsMarcoSolis commented Jul 29, 2020 •

edited

bh00ker commented Jul 29, 2020

m-burner commented Jul 29, 2020

bh00ker commented Jul 29, 2020

orisica commented Jul 29, 2020

Syd-ai commented Jul 29, 2020

setanimals commented Jul 29, 2020

l-u-f-o commented Jul 29, 2020

nsejourne commented Jul 30, 2020

krypdkat commented Aug 3, 2020

n-y-z-o Aug 10, 2020

EggPool Aug 10, 2020

n-y-z-o Aug 11, 2020

EggPool Aug 11, 2020

n-y-z-o Aug 10, 2020

n-y-z-o Aug 10, 2020

EggPool Aug 10, 2020

EggPool Aug 10, 2020

n-y-z-o Aug 11, 2020

EggPool Aug 11, 2020

Minimal fixes to address NCFP-10: Do not allow "red" in-queue verifiers anymore. #25

Are you sure you want to change the base?

Minimal fixes to address NCFP-10: Do not allow "red" in-queue verifiers anymore. #25

Conversation

EggPool commented Jul 29, 2020

itsMarcoSolis commented Jul 29, 2020 • edited

bh00ker commented Jul 29, 2020

m-burner commented Jul 29, 2020

bh00ker commented Jul 29, 2020

orisica commented Jul 29, 2020

Syd-ai commented Jul 29, 2020

setanimals commented Jul 29, 2020

l-u-f-o commented Jul 29, 2020

nsejourne commented Jul 30, 2020

krypdkat commented Aug 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itsMarcoSolis commented Jul 29, 2020 •

edited