Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After mqtt client is reconnected, it is unable to continue publishing message #1727

Closed
spencerfeng opened this issue Oct 29, 2023 · 56 comments · Fixed by #1855
Closed

After mqtt client is reconnected, it is unable to continue publishing message #1727

spencerfeng opened this issue Oct 29, 2023 · 56 comments · Fixed by #1855

Comments

@spencerfeng
Copy link

Hi there,

I am using the latest version of mqttjs (5.1.3) and this issue happens in old versions as well.

In my app, I received the offline event even when the mosquitto MQTT broker is up and running. After the mqttjs client goes through the process of offline -> closed -> reconnect -> connected. The mqttjs client can no longer publish messages again.

Is there a workaround?

app log:
image

mqtt broker log:
image

@robertsLando
Copy link
Member

Can you provide a script that reproduces the issue?

@jaketakula
Copy link

jaketakula commented Nov 23, 2023

Hey @robertsLando , we are having same issue in our react application in the fronend this time.
the error in the chrome only shows it failed at createWebsocket with array buffer
(sry I have not made screenshot for that).
the only way to let it work is to close and re-open the chrome broswer in our case.

i think now reactjs and nodejs both has the same issue.
Can i grab more attentions on this issue ? Thx.

Screenshot 2023-11-23 at 8 14 42 pm Screenshot 2023-11-23 at 8 14 58 pm

@robertsLando
Copy link
Member

@jaketakula I need to see more details about the error and how to reproduce it. Also what version are you using?

@jaketakula
Copy link

jaketakula commented Jan 9, 2024

@jaketakula I need to see more details about the error and how to reproduce it. Also what version are you using?

  • the mqttjs version being used is 5.1.3

  • this issue happens randomly in chrome broswer - roughly every1 or 2 weeks seen once.
    so i think you can easily setup a ping-pong demo app using 5.1.3 and some mqtt broker you like.
    after that, keepit it running several days and you could see the reconnecion-drop issue.

@robertsLando
Copy link
Member

this issue happens randomly in chrome broswer - roughly every1 or 2 weeks seen once.

you mean 1/2 weeks with the browser opened or what?

@jaketakula
Copy link

jaketakula commented Jan 11, 2024 via email

@robertsLando
Copy link
Member

Could be fixed by #1779 , someone could give a try to 5.3.5?

@jaketakula
Copy link

jaketakula commented Jan 24, 2024 via email

@robertsLando
Copy link
Member

@jaketakula Thanks! Any news?

@jaketakula
Copy link

jaketakula commented Jan 24, 2024 via email

@overflowz
Copy link

any news on this? we have a similar issue so far, we did upgrade to the latest version but want to make sure before we deploy (iot devices) it won't happen again.

@robertsLando
Copy link
Member

robertsLando commented Feb 10, 2024

I didn't checked my own as I never faced this issue, dunno if @jaketakula has news (but I think he would have write if a bug happended). Recent changes fixed a very old bug in reconnect/keep alive that could have caused it BTW

@overflowz
Copy link

overflowz commented Feb 14, 2024

we deployed 10 IoT devices and slowly they're getting disconnected one by one without reconnecting. So the issue still persists.

Here's the config I'm using:

    this.#conn = mqtt.connect(url, {
      cert: CERT_PATH,
      key: KEY_PATH,
      ca: CA_PATH,
      protocolId: "MQTT",
      protocolVersion: 5,
      encoding: "binary",
      clean: false,
      clientId: 'specific-IoT-id',
      keepalive: 60,
      reconnectPeriod: 1000,
      connectTimeout: 30000,
      reschedulePings: false,
      queueQoSZero: true,
      resubscribe: true,
      manualConnect: false,
    });

I also tried to add timeout on publish (since when sending QoS 1 messages, it waits until delivery happens) but fails miserably. pseudo code:

await Promise.race([publish, timer]);
if timeout then client.reconnect();

but after I'm using reconnect() method, every publish request made throws "client disconnecting" error and can't recover from it.

EDIT: can you suggest what shall I do? workaround will also do cause we're in rush right now.

EDIT2: I forgot to run npm install 🤦‍♂️ I'll test again and get back as soon as I have news.

Yup, same thing.

@robertsLando
Copy link
Member

@overflowz Please open a new bug issue and follow the steps in order to also attach DEBUG logs

@overflowz
Copy link

The problem is, it's hard to reproduce and you gotta wait for days or even weeks for it to trigger (this is for the reconnect). As for the reconnect() -- will do later today.

@robertsLando
Copy link
Member

I understand that but it's hard for me to know what's going on here without more info,,, what you could do is to also patch the log function in client and print logs to a file so you don't loose them when this happens

@AndreMaz
Copy link

AndreMaz commented Apr 26, 2024

I'm not sure if it's related but I'm facing a similar problem.

Context
My arch looks like this:

device -- Ethernet connection --> manager -- 5G connection --> MQTT broker
                                     |
                                     L______ 5G connection --> HTTP server

what happens is that every X hours the 5G modem goes down for a short period of time. When this happens I can see in the manager's logs that when it tries to communicate via HTTP it receives a EHOSTUNREACH error, which disappears when the 5G connection is back.

The problem is that when the connection is back the client.connected flips to true but the messages are not being sent.

My sender function looks something like this:

async function sendToBroker(topic, message) {
  if (!this.client.connected) {
    console.warn("Client is not connected, storing message");

    this.storage.store({ topic, message });

    return;
  }

  console.debug("Sending message to broker", message);
  await this.client.publishAsync(topic, message);
}

that is called like this:

async function send(ctx) {
    ctx.call("mqtt.sendToBroker", {
      topic: "topic",
      message: "message",
    }, 
    {
        timeout: 10000, // Throw an error if it takes more than 10 seconds
    });
}

In the logs, I can see Sending message to broker that is followed by a timeout error saying that sendToBroker() did not resolve in 10 seconds.
I'm assuming that it gets stuck in the await this.client.publish(topic, message); line.

Here's how I create my client:

this.client = mqtt.connect(url, {
  username: token,
  cert: fs.readFileSync(certPath, "utf8"),
  rejectUnauthorized: false,
});

Question
Any idea why this is happening? Is there a way to check if the connection is really up?

Might be vaguely related to: #1825

@robertsLando
Copy link
Member

robertsLando commented Apr 26, 2024

@AndreMaz Could you create a full script that I can use to reproduce the issue? By checking the other issues it seems this happens only when working with tls?

@AndreMaz
Copy link

AndreMaz commented Apr 26, 2024

@robertsLando It will be difficult as it's a proprietary code but I'll try to create a repro example

By checking the other issues it seems this happens only when working with tls?

I've been using TLS since the beginning but can't say for sure if it's the source of the problem

@overflowz
Copy link

Can confirm, we're also using with tls, haven't tried otherwise.

@robertsLando
Copy link
Member

I don't need your source code but a scripts that reproduces the issue. An easy one that connects to a broker with TLS (you can use hivemq public one https://www.hivemq.com/mqtt/public-mqtt-broker/) and then try to reproduce the disconnect and see if the problem happens. I tried last time without success and if I cannot reproduce it on my side it's hard to fix

@AndreMaz
Copy link

Yep yep, I know that you don't want it. I've created a simple script that mimics the data-flow but I can't reproduce the issue at home. I've been switching my laptop between WiFi, Ethernet, and 5G hotspot and so far no luck, ie, I did not see the timeout that I've mentioned.

@overflowz
Copy link

if it helps, since we updated to version 5.3.5, we're facing this issue less frequently, but it's still there. I believe these were the the relevant chages that could've affected it #1779

but in the latest release, there are fixes for the possible race condition again: #1848 but it's hard for us to keep updating many iot devices since we have to do a npm install where connection is very unstable.

these issues are really hard for us to test in a production environment and it also costs us a lot. is there any version that is considered stable so we can pick it for now until these issues gets fixed?

thank you!

@AndreMaz
Copy link

I also think that this is a regression that was introduced in the latest releases (don't know which one tho)

I'll have to check my previous releases and test them out. Will keep you guys updated

@robertsLando
Copy link
Member

I'm sorry for the issues guys and I would like to help you but it's hard to guess what could be the root cause here, we should firstly try to find out an easy way to reproduce the issue somehow

@overflowz
Copy link

I'm sorry for the issues guys and I would like to help you but it's hard to guess what could be the root cause here, we should firstly try to find out an easy way to reproduce the issue somehow

No worries at all! We're devs too and we understand the frustration :-) Just to be clear, I didn't mean it as a "fix it asap", I do appreciate the work the maintainers are doing, truly. I was asking if there are any old version(s) I could try so I won't get pressured by the company to fix a problem that is hard to explain why it does not work sometimes xD

Regardless, I really do wish to help somehow too, but it's really hard to reproduce :( We're currently running the code on 50 devices and it might happen once a week, two weeks or even months per one device, it's really unpredictable and random.

@robertsLando
Copy link
Member

I'm the only maintainer here unfortunately and I started because I use this package in almost all my projects and I wanted to help keeping it maintained (as it was almost died)

Based on the first comment of this issue seems this was happening also with older versions so I dunno, I'm sorry :(

@AndreMaz
Copy link

Hey @robertsLando no need to apologize. Huge kudos to you for what you're doing 💪

I'll have to check my previous releases and test them out.

Just checked, I went from 5.3.3 -> 5.5.2 -> 5.5.4

I'm rolling back to 5.3.3 and going to let it run for a while. Will keep you updated

@robertsLando
Copy link
Member

@AndreMaz Thanks!

@overflowz
Copy link

Hey @robertsLando no need to apologize. Huge kudos to you for what you're doing 💪

I'll have to check my previous releases and test them out.

Just checked, I went from 5.3.3 -> 5.5.2 -> 5.5.4

I'm rolling back to 5.3.3 and going to let it run for a while. Will keep you updated

if it helps, we were using 5.3.3 previously and the issue was appearing more frequently than now (about 3-4 times a week).

@robertsLando
Copy link
Member

Is there a way for you guys to override the log function in client constructor and store logs somewhere? It would be very helpful to see the difference in logs between a normal reconnect and one that causes the stop of publishing

@AndreMaz
Copy link

Yep, I can. I'm going to try to collect the logs

@robertsLando
Copy link
Member

Better if using latest version

@AndreMaz
Copy link

Yep yep. No problem

@overflowz
Copy link

sadly not for me :( we're using remote iot devices on 3g, which also are limited on network bandwidth. is there a way to turn on logging only for connect/reconnects (or anything related to connections only)?

@robertsLando
Copy link
Member

is there a way to turn on logging only for connect/reconnects (or anything related to connections only)?

Nope unfortunately

@AndreMaz
Copy link

AndreMaz commented Apr 29, 2024

Hey @robertsLando Is this enough info to locate the issue?

[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: publish :: message `%s` to topic `%s` {"ts":1714391751051,"values":{"cpuCores":8}} v3/telemetry
[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: publish :: qos 0
[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: MqttClient:publish: packet cmd: %s publish
[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: _sendPacket :: (%s) ::  start mqttjs_1d7221de
[2024-04-29T11:55:51.052Z] DEBUG MQTT-CLIENT: _writePacket :: packet: %O { cmd: 'publish', topic: 'v3/telemetry', payload: '{"ts":1714391751051,"values":{"cpuCores":8}}', qos: 0, retain: false, messageId: 0, dup: false }
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _writePacket :: emitting `packetsend`
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _writePacket :: writing to stream
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _writePacket :: writeToStream result %s false
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _writePacket :: handle events on `drain` once through callback.
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: _sendPacket :: (%s) ::  end mqttjs_1d7221de
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: publish :: message `%s` to topic `%s` {"cpuCores":8} v3/attributes
[2024-04-29T11:55:51.053Z] DEBUG MQTT-CLIENT: publish :: qos 0
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: MqttClient:publish: packet cmd: %s publish
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _sendPacket :: (%s) ::  start mqttjs_1d7221de
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: packet: %O { cmd: 'publish', topic: 'v3/attributes', payload: '{"cpuCores":8}', qos: 0, retain: false, messageId: 0, dup: false }
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: emitting `packetsend`
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: writing to stream
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: writeToStream result %s false
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _writePacket :: handle events on `drain` once through callback.
[2024-04-29T11:55:51.054Z] DEBUG MQTT-CLIENT: _sendPacket :: (%s) ::  end mqttjs_1d7221de
[2024-04-29T11:55:51.285Z] WARN  BROKER: Request 'mqtt.telemetry' timed out. { requestID: 'd0c3564e-d865-44f5-ad6e-efac7eecfd66', nodeID: 'misty-violet-6717', timeout: 10000 }
[2024-04-29T11:55:51.286Z] WARN  BROKER: Request 'mqtt.sendAttributes' timed out. { requestID: '25d2dd35-f844-4399-b87a-48d02e09e9de', nodeID: 'misty-violet-6717', timeout: 10000 }

The last 2 lines mean that my telemetry action

async function telemetry(topic, message) {
  if (!this.client.connected) {
    this.storage.store({ topic, message });
    return;
  }

  await this.client.publishAsync(topic, message);
}

did not resolve in 10sec

@robertsLando
Copy link
Member

@AndreMaz I don't see the reconnection in the logs, also I see the first packet is sent correctly and then the second one nope but nothing that tells the why

@mcollina Any clue what could cause this issue? It seems to happen only with TLS

@AndreMaz
Copy link

AndreMaz commented Apr 29, 2024

@robertsLando this is interesting. The logs that I've shared previously were from 11:55:51 but the re-connection logic only appears at 12:01:22

[2024-04-29T12:01:22.745Z] DEBUG MQTT-CLIENT: streamErrorHandler :: error read ENETUNREACH
[2024-04-29T12:01:22.745Z] DEBUG MQTT-CLIENT: streamErrorHandler :: emitting error
[2024-04-29T12:01:22.746Z] ERROR MQTT-SERVICE: [mqtt] Client error. Error: read ENETUNREACH
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20)
    at TLSWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
  errno: -101,
  code: 'ENETUNREACH',
  syscall: 'read'
}
[2024-04-29T12:01:22.748Z] DEBUG MQTT-CLIENT: (%s)stream :: on close mqttjs_1d7221de
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: _flushVolatile :: deleting volatile messages from the queue and setting their callbacks as error function
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: stream: emit close to MqttClient
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: close :: connected set to `false`
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: close :: clearing connackTimer
[2024-04-29T12:01:22.749Z] DEBUG MQTT-CLIENT: close :: destroy ping timer
[2024-04-29T12:01:22.750Z] DEBUG MQTT-CLIENT: close :: calling _setupReconnect
[2024-04-29T12:01:22.750Z] DEBUG MQTT-CLIENT: _setupReconnect :: emit `offline` state
[2024-04-29T12:01:22.750Z] INFO  MQTT-SERVICE: MQTT client is offline.
[2024-04-29T12:01:22.750Z] DEBUG MQTT-CLIENT: _setupReconnect :: set `reconnecting` to `true`
[2024-04-29T12:01:22.750Z] DEBUG MQTT-CLIENT: _setupReconnect :: setting reconnectTimer for %d ms 1000
[2024-04-29T12:01:22.750Z] INFO  MQTT-SERVICE: Disconnected from MQTT endpoint.
[2024-04-29T12:01:23.377Z] WARN  BROKER: Request 'mqtt.sendAttributes' is timed out. { requestID: '852d8692-9108-4ff1-b417-1ff969f0f43c', nodeID: ', timeout: 10000 }

and then again at 12:16:01

[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: _flushVolatile :: deleting volatile messages from the queue and setting their callbacks as error function
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: stream: emit close to MqttClient
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: close :: connected set to `false`
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: close :: clearing connackTimer
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: close :: destroy ping timer
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: close :: calling _setupReconnect
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: _setupReconnect :: emit `offline` state
[2024-04-29T12:16:01.374Z] INFO  THINGSBOARD: MQTT client is offline.
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: _setupReconnect :: set `reconnecting` to `true`
[2024-04-29T12:16:01.374Z] DEBUG MQTT-CLIENT: _setupReconnect :: setting reconnectTimer for %d ms 1000
[2024-04-29T12:16:01.374Z] INFO  THINGSBOARD: Disconnected from MQTT endpoint.
[2024-04-29T12:16:02.375Z] DEBUG MQTT-CLIENT: reconnectTimer :: reconnect triggered!
[2024-04-29T12:16:02.376Z] DEBUG MQTT-CLIENT: _reconnect: emitting reconnect to client
[2024-04-29T12:16:02.376Z] INFO  THINGSBOARD: Reconnecting to MQTT endpoint...
[2024-04-29T12:16:02.376Z] DEBUG MQTT-CLIENT: _reconnect: calling connect
[2024-04-29T12:16:02.378Z] DEBUG MQTT-CLIENT: connect :: calling method to clear reconnect
[2024-04-29T12:16:02.379Z] DEBUG MQTT-CLIENT: _clearReconnect : clearing reconnect timer
[2024-04-29T12:16:02.379Z] DEBUG MQTT-CLIENT: connect :: using streamBuilder provided to client to create stream
[2024-04-29T12:16:02.381Z] DEBUG MQTT-CLIENT: connect :: pipe stream to writable stream
[2024-04-29T12:16:02.381Z] DEBUG MQTT-CLIENT: connect: sending packet `connect`

@robertsLando
Copy link
Member

this is interesting. The logs that I've shared previously were from 11:55:51 but the re-connection logic only appears at 12:01:22

It could be that the connection was already broken but not closed yet so the write to stream happened but you didn't received any response back. Are you using a MQTT packet store? What is your keepalive in options?

@AndreMaz
Copy link

Are you using a MQTT packet store?

No

What is your keepalive in options?

Everything is default

Here's how I create my client:

this.client = mqtt.connect(url, {
  username: token,
  cert: fs.readFileSync(certPath, "utf8"),
  rejectUnauthorized: false,
});

@robertsLando
Copy link
Member

robertsLando commented Apr 29, 2024

@AndreMaz I don't see anything strange on your logs. The logic is correct:

  • The stream emits the error (it is just logged and emitted by client as well)
  • The stream emits the close. This causes the client to setup the reconnection
  • The reconnect timer is created and then it triggers a reconnect after 1 second

Then it would be interesting to see what happens next when packets are not sent anymore

@AndreMaz
Copy link

But then why at 11:55 it tried to send data, which means that connected=true, but then failed to resolve the publish promise?

@robertsLando
Copy link
Member

which means that connected=true, but then failed to resolve the publish promise?

I think because the connection was not broken yet? What I don't understand is that you have a keepalive of 60s but the disconnect happens much later 🤔 I think I just have discovered a possible bug here as I'm shifting the keepalive check every time we send a packet but we should do it every time we successfully write it to stream instead.

Dunno if this could solve the issue but it's worth a try

@robertsLando
Copy link
Member

Ok yes I think to have found the root cause of the problem guys 🎉 Could someone give a try to the PR #1855?

@overflowz
Copy link

If that's gonna fix it, I'm gonna cri

@robertsLando
Copy link
Member

Waiting for your feedback so! It works on my side and it also makes sense, I wasn't able to reproduce it on my side because I was publishing a message every time I received one, in your case you keep publish independently and when I did this I noticed the keepalive timeout was never happening and that's the error here, we was shifting the keepalive check on every message sent (so even on publish) causing it to never trigger even if the connection was broken

@AndreMaz
Copy link

@robertsLando thank you! Just deployed. Will leave it running for a couple of days.

I'll let you know the result

@robertsLando
Copy link
Member

I just pushed a little change to avoid possible race conditions, please redeploy with latest commit.

Now I will write some tests to better cover this things

@AndreMaz
Copy link

@robertsLando yep yep. Re-deploying it now

@AndreMaz
Copy link

AndreMaz commented May 2, 2024

FYI it's been 2 days and so far so good

@robertsLando
Copy link
Member

@AndreMaz Thanks for the feedback! I also made a release in the meanwhile so you can use that once you end the tests :) How often did it used to happen before?

@AndreMaz
Copy link

@robertsLando sorry for delay, I had to handle some other stuff.

Checked the logs and everything seems ok. However, I'm seeing the new v5.5.6 release and a keepalive PR which means that there are still some edge cases.

Should I stick with 5.5.5 or bump to the 5.5.6?

@robertsLando
Copy link
Member

robertsLando commented May 13, 2024

@AndreMaz The issue there is that the fix to this introduced a bug that happens when the client only receives packets (subscribes), in that case the keepalive ping is always skipped as in order to fix this I moved the ping reschedule on every packet received from server so the ping was never sent in that case causing the broker to force close the connection. 5.5.6 fixes this issue and #1865 will be hopefully the last keepalive related PR I will do, that refactors keepalive a lot in order to make it more respectful of specs :)

Anyway 5.5.6 is good as it will work for all cases, 5.6.0 will be the next release once #1865 lands

@AndreMaz
Copy link

Super! Thank you for your hard work

@robertsLando
Copy link
Member

Of course stronger tests have been added in order to cover such cases in future 🤞🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants