allow online peers to be removed #654

ripienaar · 2022-12-19T18:09:45Z

We can do nats server raft peer-remove but this requires a peer to be offline.

The server doesnt restrict this so we could ask it to remove a online peer, however when we tested this we found some instability resulted should the removed peer restart (and maybe other issues also)

As its been a while since we tested this properly we should again test and reconsider supporting removing online peers with -f

//cc @derekcollison

The text was updated successfully, but these errors were encountered:

derekcollison · 2022-12-19T18:11:23Z

I think those tests were early in 2.9 release cycle. At minimum we should allow it with -f.

We should also retest that instability you saw.

ripienaar · 2022-12-19T18:13:04Z

A previous test cycle got these results:

So I peer removed a online server, it went offline and was able to come back after restart (couldnt in the past). However after it came back it was in a very bad way.

Lots of

Resetting stream cluster state for 'one > ORDERS_42'
RAFT [ACbIFjhc - C-R3F-WgfPNMuK] Request from follower for index [212] possibly beyond our last index [217] - no message found
RAFT [4FFRXkOw - C-R3F-WgfPNMuK] Expected first catchup entry to be a snapshot and peerstate, will retry
Catchup for stream 'one > ORDERS_42' complete
Error applying entries to 'one > ORDERS_15': last sequence mismatch

which appears to resolve but also phantom consumers showed up that wasnt there before, couldnt be INFO’d etc
and other consumers just never got caught up with current without any logs

{"name":"n1-lon","current":false,"active":261784,"lag":217}

not sure I’d be willing to use this in any other scenario than that outlined earlier - node is dead, offline and never coming back.

ripienaar · 2022-12-19T18:13:59Z

So we should re-test what happens when an online peer is removed and it comes back after a restart (this is the problem, you remove a online peer it will most likely come back soon) and if its stable when it come back.

derekcollison · 2022-12-19T18:16:47Z

Agree. @bruth possible to test this on 2.9.10 (top of main)?

bruth · 2022-12-19T18:19:35Z

Yes, will create a test example.

bruth · 2022-12-19T19:07:49Z

https://nats-by-example-6buhofso7-connecteverything.vercel.app/examples/jetstream/online-peer-remove/cli

ripienaar · 2022-12-19T19:22:18Z

Does seem to behave better now - but still, what do we gain by removing an online node? Soon as someone restarts it it will come back?

ripienaar · 2022-12-19T19:29:18Z

I am seeing some inconsistencies with messages on disk after it comes back, like some limits didnt apply or something

│ Server  │ Cluster │ Domain │ Streams │ Consumers │ Messages │ Bytes   │ Memory │ File    │ API Req │ API Err │
├─────────┼─────────┼────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ n3-lon  │ lon     │ hub    │ 39      │ 171       │ 51,660   │ 520 MiB │ 0 B    │ 520 MiB │ 1,188   │ 0       │
│ n1-lon  │ lon     │ hub    │ 39      │ 172       │ 58,190   │ 586 MiB │ 0 B    │ 586 MiB │ 1       │ 0       │
│ n2-lon  │ lon     │ hub    │ 47      │ 171       │ 51,672   │ 520 MiB │ 0 B    │ 520 MiB │ 1,592   │ 0       │

here I bounced n1 out a few times, keeping it down etc - its should have the same message count as the other 2 but somethings clearly gone bad there once it came back.

WIll need some more time to investigate / reproduce

ripienaar · 2022-12-19T19:51:51Z

Also after removing n1-nyc while online report jsz still gets given data about it:

├─────────┬─────────┬────────┬─────────┬───────────┬──────────┬─────────┬────────┬─────────┬─────────┬─────────┤
│ Server  │ Cluster │ Domain │ Streams │ Consumers │ Messages │ Bytes   │ Memory │ File    │ API Req │ API Err │
├─────────┼─────────┼────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ n1-lon  │ lon     │ hub    │ 38      │ 172       │ 51,790   │ 521 MiB │ 0 B    │ 521 MiB │ 1       │ 0       │
│ n2-lon  │ lon     │ hub    │ 41      │ 171       │ 51,795   │ 521 MiB │ 0 B    │ 521 MiB │ 1,789   │ 0       │
│ n3-lon  │ lon     │ hub    │ 38      │ 171       │ 51,790   │ 521 MiB │ 0 B    │ 521 MiB │ 1,352   │ 0       │
│ n3-nyc* │ nyc     │ hub    │ 35      │ 165       │ 51,180   │ 515 MiB │ 0 B    │ 515 MiB │ 846     │ 0       │
│ n2-nyc  │ nyc     │ hub    │ 35      │ 165       │ 51,180   │ 515 MiB │ 0 B    │ 515 MiB │ 990     │ 0       │
│ n1-nyc  │ nyc     │        │ 0       │ 0         │ 0        │ 0 B     │ 0 B    │ 0 B     │ 0       │ 0       │
│ n3-sfo  │ sfo     │ hub    │ 35      │ 165       │ 50,945   │ 512 MiB │ 0 B    │ 512 MiB │ 725     │ 1       │
│ n1-sfo  │ sfo     │ hub    │ 34      │ 165       │ 45,705   │ 459 MiB │ 0 B    │ 459 MiB │ 0       │ 0       │
│ n2-sfo  │ sfo     │ hub    │ 34      │ 165       │ 45,705   │ 459 MiB │ 0 B    │ 459 MiB │ 739     │ 0       │
├─────────┼─────────┼────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────┤
│         │         │        │ 290     │ 1,339     │ 400,090  │ 3.9 GiB │ 0 B    │ 3.9 GiB │ 6,442   │ 1       │
╰─────────┴─────────┴────────┴─────────┴───────────┴──────────┴─────────┴────────┴─────────┴─────────┴─────────╯```

see all the 0s, thats a removed machine

ripienaar · 2022-12-19T19:55:51Z

And once it reboots and come back on - it gets no data and all the streams shows only the 2 peers

│ Stream     │ Storage │ Placement     │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas        │
├────────────┼─────────┼───────────────┼───────────┼──────────┼─────────┼──────┼─────────┼─────────────────┤
│ ORDERS_4   │ File    │ cluster: nyc  │ 5         │ 1,000    │ 10 MiB  │ 0    │ 0       │ n2-nyc, n3-nyc* │
│ ORDERS_1   │ File    │ cluster: nyc  │ 5         │ 1,000    │ 10 MiB  │ 0    │ 0       │ n2-nyc*, n3-nyc │

however meta group has it up and healthy

╭──────────────────────────────────────────────────────────────╮
│                 RAFT Meta Group Information                  │
├────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name   │ ID       │ Leader │ Current │ Online │ Active │ Lag │
├────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ n1-lon │ 4FFRXkOw │        │ true    │ true   │ 0.74s  │ 0   │
│ n1-nyc │ chC4jxAp │        │ true    │ true   │ 0.82s  │ 0   │
│ n1-sfo │ Vky05QUP │        │ true    │ true   │ 0.74s  │ 0   │
│ n2-lon │ ACbIFjhc │        │ true    │ true   │ 0.74s  │ 0   │
│ n2-nyc │ ypSxRZrl │        │ true    │ true   │ 0.82s  │ 0   │
│ n2-sfo │ Qp0rD5EC │        │ true    │ true   │ 0.74s  │ 0   │
│ n3-lon │ J0t0ySUw │        │ true    │ true   │ 0.74s  │ 0   │
│ n3-nyc │ P2IE71bg │ yes    │ true    │ true   │ 0.00s  │ 0   │
│ n3-sfo │ tYwrA5d9 │        │ true    │ true   │ 0.75s  │ 0   │
╰────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯

Yet when it started up it logged its restoring etc

I think we have a bunch of work to do before we can consider allowing this.

ripienaar · 2022-12-19T19:59:40Z

I don’t think these problems are limited to taking off online nodes and bringing tbem back - I think it’s in general issues with nodes coming back

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow online peers to be removed #654

allow online peers to be removed #654

ripienaar commented Dec 19, 2022

derekcollison commented Dec 19, 2022

ripienaar commented Dec 19, 2022 •

edited

ripienaar commented Dec 19, 2022

derekcollison commented Dec 19, 2022 •

edited

bruth commented Dec 19, 2022

bruth commented Dec 19, 2022

ripienaar commented Dec 19, 2022

ripienaar commented Dec 19, 2022

ripienaar commented Dec 19, 2022

ripienaar commented Dec 19, 2022

ripienaar commented Dec 19, 2022

allow online peers to be removed #654

allow online peers to be removed #654

Comments

ripienaar commented Dec 19, 2022

derekcollison commented Dec 19, 2022

ripienaar commented Dec 19, 2022 • edited

ripienaar commented Dec 19, 2022

derekcollison commented Dec 19, 2022 • edited

bruth commented Dec 19, 2022

bruth commented Dec 19, 2022

ripienaar commented Dec 19, 2022

ripienaar commented Dec 19, 2022

ripienaar commented Dec 19, 2022

ripienaar commented Dec 19, 2022

ripienaar commented Dec 19, 2022

ripienaar commented Dec 19, 2022 •

edited

derekcollison commented Dec 19, 2022 •

edited