New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nats server pod gets stuck in stream catchup after restart #5205
Comments
Observing the same issue on latest NATS version as well. Tested versions:
We have ~20 streams each of them receiving multiple writes.
Workaround we found so far to get node to healthy state without data loss is to temporarily change replication factor for affected stream (from 3 to 1), then restore original replication factor value.
This way affected node does not need to host this stream, so node can finish initialisation. |
Same issue, nats-2.10.12. I went from 1 replica to 2 replicas for my stream, it helped. |
We have experienced the same issue in multiple environments for interest based Jetstream (not KV). Environment information
SymptomWe noticed that the symptom seems to be caused by the
ReproduceIt's relatively easy to reproduce this issue in a 3-replica Kubernetes hosted NATs cluster with Jetstream enabled.
Actual behaviourThe Expected behaviourThe MitigationAs @bondar-pavel suggested, editing the stream replicas to 1 and then back to 3 seems to be the only mitigation without losing data or causing downtime. NotesStream configuration{
"config": {
"name": "<redacted>",
"subjects": [
"<redacted>"
],
"retention": "interest",
"max_consumers": -1,
"max_msgs_per_subject": -1,
"max_msgs": -1,
"max_bytes": -1,
"max_age": 0,
"max_msg_size": -1,
"storage": "file",
"discard": "old",
"num_replicas": 3,
"duplicate_window": 10000000000,
"sealed": false,
"deny_delete": false,
"deny_purge": false,
"allow_rollup_hdrs": false,
"allow_direct": true,
"mirror_direct": false
},
"created": "2024-03-28T17:59:55.75752643Z",
"state": {
"messages": 1372,
"bytes": 127596,
"first_seq": 11761831,
"first_ts": "2024-03-28T19:27:57.692021989Z",
"last_seq": 11763599,
"last_ts": "2024-03-28T19:27:58.395525207Z",
"num_deleted": 397,
"num_subjects": 1,
"consumer_count": 1
},
"cluster": {
"name": "kubernetes-nats",
"leader": "kubernetes-nats-2",
"replicas": [
{
"name": "kubernetes-nats-0",
"current": true,
"active": 196170
},
{
"name": "kubernetes-nats-1",
"current": true,
"active": 220688
}
]
},
"ts": "2024-03-28T19:27:58.395990925Z"
} Consumer configuration{
"stream_name": "<redacted>",
"name": "<redacted>",
"config": {
"ack_policy": "explicit",
"ack_wait": 30000000000,
"deliver_policy": "new",
"durable_name": "<redacted>",
"name": "<redacted>",
"max_ack_pending": 20000,
"max_deliver": 5,
"max_waiting": 512,
"replay_policy": "instant",
"num_replicas": 0
},
"created": "2024-03-28T17:59:56.100249122Z",
"delivered": {
"consumer_seq": 38939132,
"stream_seq": 12980344,
"last_active": "2024-03-28T19:35:54.238507718Z"
},
"ack_floor": {
"consumer_seq": 38933971,
"stream_seq": 12978586,
"last_active": "2024-03-28T19:35:54.197600505Z"
},
"num_ack_pending": 1545,
"num_redelivered": 1121,
"num_waiting": 4,
"num_pending": 0,
"cluster": {
"name": "kubernetes-nats",
"leader": "kubernetes-nats-2",
"replicas": [
{
"name": "kubernetes-nats-0",
"current": true,
"active": 13341
},
{
"name": "kubernetes-nats-1",
"current": true,
"active": 146374
}
]
},
"ts": "2024-03-28T19:35:54.238705331Z"
} |
We have some improvements coming in 2.10.14 around this which hopefully can help. If you are feeling adventuresome feel free to grab a binary from the
|
I've been playing with @derekcollison 's dev build from
As shown in the graph, the
Below are some error messages from the first dev build pod during rollout.
|
We will start cleaning up and cherry picking into main and then into 2.10.14. Thanks for checking though! |
Note that 2.10.14 has now been released (and now 2.10.15 should be very soon) |
(copying the reply from nats slack) I just repeated my chaos tests with 2.10.14, basically having a fast moving stream with a publisher (nats pub ) and an interest based consumer, while rolling restarting my 3-replica statefulset of NATs. Overall it’s been a LOT better. previously, almost 100% of the time the first_seq number of the fast moving stream would go out of sync, particularly the nats-2 instance that were restarted the first would almost always have its Note that the first rolling restart did reproduce the Overall I’m really happy with 2.10.14. If it goes well it would address a major operation pain point of Jetstream. I’ll deploy it to prod soon and let you know if anything comes up. Thanks again for the hard work! |
After restart 1 by 1 node in cluster on 2.10.14 we lost 60% kv storage. Reproduce on 1 of 10 clusters after restart.
|
What does stream info for the underlying KV show? |
01 node
02 node
|
You can look on last 10k lines log 01 STOP DELETElook on "num_deleted": 59009,
02
03
config
|
We fixed an issue with discard new that is in main and will be in 2.10.15, however your KV may have already had inconsistencies so when a new leader was elected it used its state. The way to sync a known good state is make sure that replica is leader, and scale down to 1 then back up. Once upgraded to 2.10.15 the issue should not occur anymore. |
Observed behavior
I discovered this when exploring key value and testing resillience and performace among other things.
I used a helm install in kubernetes.
While updating a key value bucket (in memory, with replication 3) with about 1000 updates/s I restarted one pod. It never came back up and it got stuck in a stream catchup state (waited over 20 minutes). Subsequent restarts did not resolve the issue. It only got resolved after the stream was deleted (or in subsequent tests when all nats pods where stopped).
See also this thread in slack: https://natsio.slack.com/archives/C06EN6HCWE4/p1710151450266159
Stream:
Log snippet:
Expected behavior
I expected the restarted pod to catch up and come back into the cluster.
Server and client version
Server version: 2.10.11
Client version (golang client): v1.32.0
Host environment
Both server and client was running in managed Kubernetes in Google Cloud, using Google Filestore as perstistence layer.
Steps to reproduce
It is a bit hard to reproduce, I only manged to reproduce it about 1/5 tries.
What I did was:
I also tried to reproduce using a file store for the stream, but didn't manage to do it within 10 tries or so.
The text was updated successfully, but these errors were encountered: