Sync and self-healing problems amongst Stream replicas #5308

Kawon1 · 2024-04-12T17:08:43Z

Kawon1
Apr 12, 2024

Hello, could you please provide more information about how the drifted stream replica issue is being handled by the NATS JetStream Cluster while also managing new requests? For the context, i will be talking about NATS JetStream being deployed in kubernetes environment (AKS) and performing tests on self-healing/sync of Stream cluster by purposefully purge all of the filestores on one of the NATS JetStreams - for the purpose of this experiment it is called nats-0.

When it comes to unexpected OOMKill caused by a replica that is lagging behind the others it seems that NATS JetStream does not appear to have any rate limiters and tries to catch up with the leader as fast as it can. I've observed that The NATS JetStream is highly optimized for managing extremely high traffic volumes! Nevertheless, it doesn't appear that the process of restoring and catching up with streams is particularly optimized. Here is the graph representing the difference between messages on Stream replicas for the specfic Stream, let's call it StreamOne. The Query used for this Prometheus graph: sum(nats_stream_total_messages{stream_name=~"$stream"}) by (pod)

I intentionally erased the storage on nats-0 (the green one) at 13:37 to simulate the breakdown scenario and test the self-healing capabilities of NATS JetStream. Such disasters might be happening in production environment. At this point this replica was behind the leader and was attempting to catch up with the mentioned leader. For clarification, the situation did not occur when the traffic was stopped, and the only task within this Stream cluster was to catch up with the leader by nats-0 however it seems to be completely different when 2 of these situations are happening - catching up with streams and receiving new requests. The gaps between green dots/lines are the result of OOMKill, which was the result of catching up with the leader and receiving requests by the other members of RAFT cluster. The fact that the same action was taken at 16:58 and 18:00 further, i see the deterministic nature of the situation in light of our configuration. Here are the logs which depicts the situation of catching up regarding the StreamOne stream:

[514] 2024/04/08 13:53:26.812182 [INF]   Starting restore for stream 'natsjetstream > streamone'
[514] 2024/04/08 13:53:28.031381 [WRN] Filestore [streamone] Stream state encountered internal inconsistency on recover
[514] 2024/04/08 13:53:36.924025 [WRN] Healthcheck failed: "failed to be ready for connections after 1ms: server, route"
[514] 2024/04/08 13:53:46.891897 [WRN] Healthcheck failed: "failed to be ready for connections after 1ms: server, route"
...
[514] 2024/04/08 13:55:30.655563 [INF]   Restored 26,511,033 messages for stream 'natsjetstream > streamone' in 2m3.843s
...
[514] 2024/04/08 13:55:37.691592 [WRN] Filestore [S-R3F-pZvbGQhp] Stream state detected prior state, could not locate msg block 46
[514] 2024/04/08 13:55:37.745770 [WRN] Filestore [C-R3F-bHozQbHv] Stream state encountered internal inconsistency on recover 
`(I guess that these logs above concerns each Stream which have its own filestore?)`
...
514] 2024/04/08 13:55:38.278612 [WRN] Error applying entries to 'natsjetstream > streamone': last sequence mismatch
[514] 2024/04/08 13:55:38.280351 [WRN] Resetting stream cluster state for 'natsjetstream > streamone'
...
[514] 2024/04/08 13:56:06.899018 [WRN] Healthcheck failed: "JetStream stream 'natsjetstream > streamone' is not current"
[514] 2024/04/08 13:56:12.016823 [WRN] RAFT [NB8ZcEMx - S-R3F-7CZTu3oa] 15000 append entries pending
[514] 2024/04/08 13:56:26.888566 [WRN] Healthcheck failed: "JetStream stream 'natsjetstream > streamone' is not current"
[514] 2024/04/08 13:56:36.892870 [WRN] Healthcheck failed: "JetStream stream 'natsjetstream > streamone' is not current"
...
(After few messages as above regarding Healthcheck failed, the pod containing the nats-0 is getting OOM, so we do not enter any lame duck mode)

What's more interesting is that the second stream which name, for instance is StreamTwo was drifted away from leader completely as it is depicted on the graph below (It's the same situation/time which was described above but with different stream; the purging of whole storage was done at this same time what is being seen by the gaps between green dots/lines) :

Here are the logs regarding the StreamTwo stream:

[514] 2024/04/08 13:55:30.657395 [INF]   Starting restore for stream 'natsjetstream > streamtwo'
[514] 2024/04/08 13:55:30.753722 [WRN] Filestore [streamtwo] Stream state encountered internal inconsistency on recover
...
[514] 2024/04/08 13:55:37.677278 [INF]   Restored 1,034,247 messages for stream 'natsjetstream > streamtwo' in 7.02s

Now, at timestamp 15:40, we can observe that the NATS JetStream has caught up; it is claiming that all streams are current because there is no OOM (no gaps between green dots/lines); the pod was ready because it passed the startUpProbe provided by default in official Helm Chart of NATS. But is it really? Absolutely not! On this graph, we can observe that the number of messages in nats-0 is parallel to the number of messages in nats-1 and nats-2, indicating that the synchronization process is complete even though the replica of stream on nats-0 is completely different!. But according to the logs, the streamtwo stream is current (as seen in the later case, which is also shown on these graphs):

[13] 2024/04/08 18:56:32.527416 [INF] Catchup for stream 'natsjetstream > streamtwo' complete

Here is the graph which depicts all of the messages without specifying any stream (for this case, only 2 streams were storing horendously many messages):

They are just drifted and self-healing doesn't seem to be working very well. In this case the raw bytes of messages are on the level of 231 GiB (nats-1, nats-2) whereas the nats-0 has got less than half of 231 GiB what is shown in these graphs.

Even though OOMKills were happening, the Stream replicas were converged in the end... But as you see it's not necessarily true for every stream.

Regarding this issue, my questions are:

It appears that if stream streamtwo leader was moved to nats-0, we might experience data loss as a result of attempting to catch up to the leader by nats-1 and nats-2 because as i showed you, in spite of huge lag/drift of messages the streamtwo stream appears to be current. So in this case we would get these kind of messages, or something similar?

[13] 2024/04/05 16:57:10.029979 [WRN] Catchup for stream 'natsjetstream > anylaggingstream' resetting first sequence: XYZ on catchup request

OR

[15] 2024/04/08 16:56:29.434836 [WRN] Error applying entries to 'natsjetstream > anylaggingstream': last sequence mismatch

'Self-healing' appears to be limited to ideal conditions; it may not always function in the event of disturbances or when one stream is significantly behind the other?
Are there any recommendations you could make regarding how to set up the NATS JetStream to avoid the huge memory peaks? Can you provide some guidance on how to adjust the NATS JetStream in a Kubernetes environment? For instance, considering memory limits, GOMEMLIMIT, limiting the number of goroutines, and so forth. Do you have any additional suggestions based on your experience for combining these limits? GOMEMLIMIT - for example, ought to be at 75% of the pod memory limit?
What is the reason of not fixing XXX.blk.tmp blocks i the filestore of stream? I've observed that some of them are being fixed/transformed to the XXX.blk but some of them are not. Maybe it is the reason of such disruptions so Stream may be losing the idea of its indexes, state and etc so we may end up in unsynchronized state of Stream on few replicas across Stream cluster
Is it true that you have not got any rate limiters? That would prevent from exhausting the resources during catchup? So how should we handle such cases in production environment where drift/lag can be much bigger than 231 GiB? Because it doesn't seem to be a lot considering handling AI Workloads, IOT use cases and etc.
Have you got any tips how to enforce the Stream to be synchronized? peer-removing, changing the number of replicas of affected Stream seems to be a good practice or do you propose something different? Isn't there a risk of data loss, for example if one of the Stream (in my example StreamTwo) has got completely different number of messages on one of the JetStreams but according to JetStream is current... So assuming that this Stream in future will become the leader it seems that data loss may occur

wallyqs · 2024-04-12T18:41:56Z

wallyqs
Apr 12, 2024
Maintainer

Thanks @Kawon1 for the detailed write up, the last sequence mismatch that shows as a result of the replica drift is a condition that should be solved in the v2.10.14 release from this week and it is related to the rollout and recovery from restarts (you may need longer startup probes for the streams to catch up), to recover from it you may have to reset the affected stream from replica size R=3 to be an R=1 in the healthy replica and maybe delete the affected consumers for that stream. Could you share the type of streams that you were using when running into this issue (stream info)? Also the type of storageclasses being used for the volumes in Azure (ssd)?

1 reply

Kawon1 Apr 15, 2024
Author

Hi here is the type of stream StreamOne

              Subjects: streamone.*.*
              Replicas: 3
               Storage: File
           Compression: S2 Compression

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
            Direct Get: true
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: 12h0m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-0
               Replica: nats-1, current, seen 160ms ago
               Replica: nats-2, current, seen 160ms ago

State:

              Messages: 39,073,494
                 Bytes: 32 GiB
        First Sequence: 740,989,084 @ 2024-04-14 21:18:52 UTC
         Last Sequence: 780,062,577 @ 2024-04-15 09:18:52 UTC
      Active Consumers: 0
    Number of Subjects: 1

Here is the type of StreamTwo

              Subjects: streamtwo.>
              Replicas: 3
               Storage: File
           Compression: S2 Compression

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
            Direct Get: true
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: 2d0h0m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-0
               Replica: nats-1, current, seen 445ms ago
               Replica: nats-2, current, seen 445ms ago

State:

              Messages: 169,395,592
                 Bytes: 222 GiB
        First Sequence: 534,459,804 @ 2024-04-13 09:18:47 UTC
         Last Sequence: 703,855,395 @ 2024-04-15 09:18:46 UTC
      Active Consumers: 2
    Number of Subjects: 1

The storageclass is Premium SSD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync and self-healing problems amongst Stream replicas #5308

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Sync and self-healing problems amongst Stream replicas #5308

Kawon1 Apr 12, 2024

Replies: 1 comment · 1 reply

wallyqs Apr 12, 2024 Maintainer

Kawon1 Apr 15, 2024 Author

Kawon1
Apr 12, 2024

Replies: 1 comment 1 reply

wallyqs
Apr 12, 2024
Maintainer

Kawon1 Apr 15, 2024
Author