New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JetStream KV gets consumer stuck in a cluster node #691
Comments
The multiple consumers is not the problem - if the ordered consumer (the watcher fails) it will recreate it from underneath - the client will attempt to remove the old consumer but it may not succeed, specially if the cluster is flapping. It will then attempt to recreate the consumer, and the server will prune the old consumer after 5 seconds. |
Also if the consumer is never reaped, you have 2 processes that are watching/consuming etc. As you can see below, we create a new subscription with a new inbox, and the old consumer unsubs. nats.deno/jetstream/jsclient.ts Line 784 in cdce825
Is your watch throwing an error? |
Thanks for the explanation. It has been observed that the consumer is never reaped, even with only one KV watch started by the client. Assuming the app client starts and connects to the
The suspicion is that when the node (nats3) is unavailable, it can't unsubscribe the old consumer, and the client keeps the old subscription active when Checking When the Are updates being propagated through both existing consumers? I noticed the Ack Floor receives an update when publishing a change to the KV_testing: I haven't experienced error exceptions following the steps described. |
Couple of things: When the client disconnects - the server terminates all interest for that client (subjects the client is interested in) In the case of the ordered consumer, if the server comes back quickly enough, it may still have the consumer (which is reaped after the specified time (would be nice if you could print the full consumer infos for the clients on the KV_testing) When the client reconnects, there's a slight chance that the consumer is still live, and the subscription is rewired between the consumer and client, and things resume. If the client detects a sequence gap or if the heartbeat monitor detects that the client is not getting messages, the client will recreate the consumer. If you are seeing 2 different consumers and one of them is not going away, there are 2 different consumers on that KV. |
I did another local test - which uses some refactored APIs but shares all the changes in your current client: import { connect, millis } from "../src/mod.ts";
import { Kvm } from "../kv/mod.ts";
const nc = await connect();
const kvm = new Kvm(nc);
const kv = await kvm.create("A", { replicas: 3, history: 100 });
const w = await kv.watch();
console.log(millis((await w._data.consumerInfo()).config.inactive_threshold));
(async () => {
for await (const e of w) {
console.log(`${e.key}: ${e.string()} - ${e.revision}`);
}
})().catch((err) => {
console.log(err);
throw err;
});
let i = 1;
setInterval(() => {
const idx = i++;
kv.put("a", `${idx}`).catch((err) => {
i--;
});
}, 3000);
setInterval(() => {
const subs = nc.protocol.subscriptions.all();
const subjects = subs.map((sub) => {
return sub.subject || sub.requestSubject;
});
console.log(subjects);
}, 1000); And ran that against my cluster tool - https://github.com/nats-io/nats.deno/blob/d32538fa38bbd636826d53216ad5b86e5a71708e/tests/helpers/cluster.ts And started chaos on it - where random servers are restarted, creating a very hostile environment:
Note that at any one point the client only has 2 suscriptions, one of them changes whenever the ordered consumer resets. This means that somehow a cluster route is possibly staying open (from a previous subscription) but for a fact shows that the process is NOT listening on that subject. I don't know how long before the server figures out that it has that stale situation. Have you observed when it goes away? |
Thanks for sharing the test and results! I added the loggings into my local tests and got the same behavior. The client didn't report listening to multiple subjects. |
@ThalesValentim I found the issue - will have a fix and a release in a bit |
@ThalesValentim all the javascript clients have been released with the above fix! Thanks for helping me get to the bottom of this! |
Thanks a lot! |
Observed behavior
The KV consumer subscription remains active without interest after the client reconnects to another available cluster node.
Freshly, I had discussed the possibility of nats-server issue (see 5352) and it has been raised that it might be a client bug when keeping interest on the old subscription.
Expected behavior
Consumers should be unsubscribed/removed if the client has reconnected to another available cluster node and has no interest on it.
Server and client version
Host environment
Steps to reproduce
Considering that nats-server has 3 cluster servers running
The text was updated successfully, but these errors were encountered: