Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog failures #171

Open
puddly opened this issue Dec 9, 2023 · 9 comments
Open

Watchdog failures #171

puddly opened this issue Dec 9, 2023 · 9 comments

Comments

@puddly
Copy link
Contributor

puddly commented Dec 9, 2023

#170 added a watchdog command (VR) that is sent every 30s to ensure the radio is still alive. It seems that the XBee serial protocol can't handle this and the command times out, causing a restart. If I add an asyncio.Lock around _at_partial it doesn't seem to help either.

@Shulyaka Are you familiar with the serial protocol? Do you happen to know why this would be the case?

Below is my patchset to enable a send lock:

diff --git a/zigpy_xbee/api.py b/zigpy_xbee/api.py
index b4a73da..51ba61f 100644
--- a/zigpy_xbee/api.py
+++ b/zigpy_xbee/api.py
@@ -289,6 +289,7 @@ class XBee:
         self._cmd_mode_future: Optional[asyncio.Future] = None
         self._reset: asyncio.Event = asyncio.Event()
         self._running: asyncio.Event = asyncio.Event()
+        self._send_lock = asyncio.Lock()
 
     @property
     def reset_event(self):
@@ -353,12 +354,13 @@ class XBee:
         LOGGER.debug("Remote AT command: %s %s", name, args)
         data = t.serialize(args, (AT_COMMANDS[name],))
         try:
-            return await asyncio.wait_for(
-                self._command(
-                    "remote_at", ieee, nwk, options, name.encode("ascii"), data
-                ),
-                timeout=REMOTE_AT_COMMAND_TIMEOUT,
-            )
+            async with self._send_lock:
+                return await asyncio.wait_for(
+                    self._command(
+                        "remote_at", ieee, nwk, options, name.encode("ascii"), data
+                    ),
+                    timeout=REMOTE_AT_COMMAND_TIMEOUT,
+                )
         except asyncio.TimeoutError:
             LOGGER.warning("No response to %s command", name)
             raise
@@ -367,10 +369,11 @@ class XBee:
         LOGGER.debug("%s command: %s %s", cmd_type, name, args)
         data = t.serialize(args, (AT_COMMANDS[name],))
         try:
-            return await asyncio.wait_for(
-                self._command(cmd_type, name.encode("ascii"), data),
-                timeout=AT_COMMAND_TIMEOUT,
-            )
+            async with self._send_lock:
+                return await asyncio.wait_for(
+                    self._command(cmd_type, name.encode("ascii"), data),
+                    timeout=AT_COMMAND_TIMEOUT,
+                )
         except asyncio.TimeoutError:
             LOGGER.warning("%s: No response to %s command", cmd_type, name)
             raise
@@ -597,9 +600,3 @@ class XBee:
                 raise APIException("Failed to configure XBee for API mode")
         finally:
             self.close()
-
-    def __getattr__(self, item):
-        """Handle supported command requests."""
-        if item in COMMAND_REQUESTS:
-            return functools.partial(self._command, item)
-        raise AttributeError(f"Unknown command {item}")
diff --git a/zigpy_xbee/zigbee/application.py b/zigpy_xbee/zigbee/application.py
index 2158b95..4ad8e62 100644
--- a/zigpy_xbee/zigbee/application.py
+++ b/zigpy_xbee/zigbee/application.py
@@ -302,7 +302,8 @@ class ControllerApplication(zigpy.application.ControllerApplication):
                 "Cannot send a packet to a device without a known IEEE address"
             )
 
-        send_req = self._api.tx_explicit(
+        send_req = self._api._command(
+            "tx_explicit",
             long_addr,
             short_addr,
             packet.src_ep or 0,
@@ -356,7 +357,7 @@ class ControllerApplication(zigpy.application.ControllerApplication):
         # Key type:
         # 0 = Pre-configured Link Key (KY command of the joining device)
         # 1 = Install Code With CRC (I? command of the joining device)
-        await self._api.register_joining_device(node, reserved, key_type, link_key)
+        await self._api._command("register_joining_device", node, reserved, key_type, link_key)
 
     def handle_modem_status(self, status):
         """Handle changed Modem Status of the device."""
@Shulyaka
Copy link
Contributor

Shulyaka commented Dec 9, 2023

I am already looking into this issue, don't know the root cause yet.

@Shulyaka
Copy link
Contributor

Shulyaka commented Dec 9, 2023

From my logs I can see that the protocol is correct, but the response is significantly delayed:

2023-12-09 14:38:49.048 DEBUG (MainThread) [zigpy_xbee.api] at command: VR ()
2023-12-09 14:38:49.050 DEBUG (MainThread) [zigpy_xbee.api] Command at (b'VR', b'')
2023-12-09 14:38:49.051 DEBUG (MainThread) [zigpy_xbee.uart] Sending: b'\x08\x06VR'
2023-12-09 14:38:49.063 DEBUG (MainThread) [zigpy_xbee.uart] Frame received: b'\x88\x06VR\x00\x10\x12'
2023-12-09 14:38:49.063 DEBUG (MainThread) [zigpy_xbee.api] Frame received: at_response
...
2023-12-09 14:38:54.103 DEBUG (MainThread) [zigpy_xbee.api] at command: VR ()
2023-12-09 14:38:54.103 DEBUG (MainThread) [zigpy_xbee.api] Command at (b'VR', b'')
2023-12-09 14:38:54.103 DEBUG (MainThread) [zigpy_xbee.uart] Sending: b'\x08*VR'
...
2023-12-09 14:39:04.105 WARNING (MainThread) [zigpy_xbee.api] at: No response to VR command
2023-12-09 14:39:04.105 WARNING (MainThread) [zigpy.backups] Failed to create a network backup
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/tasks.py", line 500, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy/backups.py", line 220, in _backup_loop
    await self.create_backup()
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy/backups.py", line 143, in create_backup
    await self.app.load_network_info(load_devices=load_devices)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/zigbee/application.py", line 144, in load_network_info
    version = await self._api._at_command("VR")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 370, in _at_partial
    return await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/tasks.py", line 502, in wait_for
    raise exceptions.TimeoutError() from exc
TimeoutError
...
2023-12-09 14:39:23.820 DEBUG (MainThread) [zigpy_xbee.api] at command: VR ()
2023-12-09 14:39:23.820 DEBUG (MainThread) [zigpy_xbee.api] Command at (b'VR', b'')
2023-12-09 14:39:23.820 DEBUG (MainThread) [zigpy_xbee.uart] Sending: b'\x088VR'
...
2023-12-09 14:39:33.825 WARNING (MainThread) [zigpy_xbee.api] at: No response to VR command
2023-12-09 14:39:33.825 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/tasks.py", line 500, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy/application.py", line 656, in _watchdog_loop
    await self._watchdog_feed()
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/zigbee/application.py", line 55, in _watchdog_feed
    await self._api._at_command("VR")
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 370, in _at_partial
    return await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/tasks.py", line 502, in wait_for
    raise exceptions.TimeoutError() from exc
TimeoutError
2023-12-09 14:39:33.840 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError()
2023-12-09 14:39:33.841 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError()
2023-12-09 14:39:33.842 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop
...
2023-12-09 14:39:45.829 DEBUG (MainThread) [zigpy_xbee.uart] Frame received: b'\x88*VR\x00\x10\x12'
2023-12-09 14:39:45.829 DEBUG (MainThread) [zigpy_xbee.api] Frame received: at_response
2023-12-09 14:39:45.829 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/srv/homeassistant/lib/python3.11/site-packages/serial_asyncio_fast/__init__.py", line 137, in _read_ready
    self._protocol.data_received(data)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/uart.py", line 95, in data_received
    self.frame_received(frame)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/uart.py", line 103, in frame_received
    self._api.frame_received(frame)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 392, in frame_received
    getattr(self, f"_handle_{command}")(*data)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 414, in _handle_at_response
    fut.set_result(response)
asyncio.exceptions.InvalidStateError: invalid state
...
2023-12-09 14:40:00.257 DEBUG (MainThread) [zigpy_xbee.uart] Frame received: b'\x888VR\x00\x10\x12'
2023-12-09 14:40:00.258 DEBUG (MainThread) [zigpy_xbee.api] Frame received: at_response
2023-12-09 14:40:00.258 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/srv/homeassistant/lib/python3.11/site-packages/serial_asyncio_fast/__init__.py", line 137, in _read_ready
    self._protocol.data_received(data)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/uart.py", line 95, in data_received
    self.frame_received(frame)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/uart.py", line 103, in frame_received
    self._api.frame_received(frame)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 392, in frame_received
    getattr(self, f"_handle_{command}")(*data)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 414, in _handle_at_response
    fut.set_result(response)
asyncio.exceptions.InvalidStateError: invalid state

14:38:49.051 - Sending VR on init
14:38:49.063 - Response to VR (12 ms)
14:38:54.103 - Sending VR on watchdog
14:39:23.820 - Sending VR again
14:39:45.829 - Response to the second VR (51.726 s, after timeout)
14:40:00.257 - Response to the third VR (36.437 s, after timeout)

It is interesting that I don't have this issue on my testing environment, only on production.

@Shulyaka
Copy link
Contributor

Shulyaka commented Dec 9, 2023

I have no knowledge of the module internals, but my best theory is that it puts all requests in a queue. On startup, we are doing a lot of attribute reads, which takes time (especially if a remote device is off or has poor connection quality), so the response even for a local command gets delayed. The lock didn't help because it didn't cover all possible frames, only commands. I think that the module is capable of processing several requests in parallel, but when their number gets high enough we have a queue saturation condition.

So I would suggest two things to address it:

  1. Increase AT_COMMAND_TIMEOUT and REMOTE_AT_COMMAND_TIMEOUT to 60 or even more (it would depend on the number of devices in the network for which we perform attribute read on startup - usually the mains powered ones; 60 works in my case). May be even dynamically calculate the timeout (e.g. 5 * (1 + len(self._awaiting))).
  2. Don't do the watchdog before zha fully initializes (not sure if we can know the zha component state inside zigpy)

@puddly
Copy link
Contributor Author

puddly commented Dec 10, 2023

@Shulyaka If you have time to test, see if zigpy==0.60.1 (https://github.com/zigpy/zigpy/releases/tag/0.60.1) fixes things for you. I think pausing the watchdog during high-load periods will be enough of a stopgap to fix the current issue.

@Shulyaka
Copy link
Contributor

Will test today.

@Shulyaka
Copy link
Contributor

Nope, it does not solve the issue for XBee.
It's not a concurrency problem, the device receives the request. but it takes it too much time to respond.
Can we try a different strategy for watchdog, like only send probes if nothing was received over serial port for a certain period? I.e. if any traffic goes through, consider the connection as active. If nothing was received for more than a minute, then send the VR command.

@puddly
Copy link
Contributor Author

puddly commented Dec 10, 2023

This seems like a bug with the radio library. Perhaps there should be a global concurrency limit for requests? It should be possible to multiplex Zigbee requests along with normal radio traffic without completely locking up the XBee like this for 60+ seconds.

@Shulyaka
Copy link
Contributor

The specs don't mention such limit, at least I could not find it.

@puddly
Copy link
Contributor Author

puddly commented Dec 11, 2023

I've merged the PR to disable the watchdog for now as we can always re-enable it later.

I think #173 may be the proper fix for this problem but I don't have a real XBee network to test it with. If you have time to try it out some time in the future, let me know how it goes (and what concurrency limit allows you to control a lot of devices at once without any retrying from ZHA).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants