Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness for multiple client connections #803

Merged
merged 6 commits into from
Oct 17, 2022

Conversation

achim-k
Copy link
Contributor

@achim-k achim-k commented Oct 7, 2022

Public API Changes
None

Description
Supersedes #741
Fixes #783
Fixes #758

This PR improves multi-client connection robustness by

  • avoiding unregistering of subscribers if one client disconnects while all others are still waiting for the initial callback
  • avoiding a deadlock that could occur when new clients connect, subscribe to a topic and then immediately disconnect again

The deadlock could be easily reproduced by rapidly connecting and disconnecting clients while one client was connected and receiving data.

# Create a client that stays connected
websocat ws://localhost:9090 --no--close < subscribe.json 

# Spawn a couple of clients that only sent the subscription operations and then disconnect
for i in {1..5}; do websocat ws://localhost:9090 < subscribe.json; done

subscribe.json (example):

{ "op": "subscribe", "topic": "/ns_1/camera/color/image_raw", "type": "sensor_msgs/msg/Image" }
{ "op": "subscribe", "topic": "/ns_1/camera/depth/camera_info", "type": "sensor_msgs/msg/CameraInfo" }

@achim-k achim-k changed the title Achim/multiple clients robustness Improve robustness for multiple client connections Oct 7, 2022
@achim-k achim-k requested a review from jtbandes October 7, 2022 14:19
@achim-k
Copy link
Contributor Author

achim-k commented Oct 7, 2022

@zflat / @MrBlenny would be great if you could take a look

@zflat
Copy link
Contributor

zflat commented Oct 10, 2022

These changes look good to me. I did a quick test locally to confirm that these changes address the issues seen in #783, at least for my setup. I still get these errors intermittently but the rosbridge server continues working and accepting new connections.

[rosbridge_websocket-2] ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fdabb9b4970>>, )
[rosbridge_websocket-2] Traceback (most recent call last):
[rosbridge_websocket-2] File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 741, in _run_callback
[rosbridge_websocket-2] ret = callback()
[rosbridge_websocket-2] File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 765, in _discard_future_result
[rosbridge_websocket-2] future.result()
[rosbridge_websocket-2] File "/usr/lib/python3/dist-packages/tornado/gen.py", line 234, in wrapper
[rosbridge_websocket-2] yielded = ctx_run(next, result)
[rosbridge_websocket-2] File "/opt/rosbridge_ws/install/rosbridge_server/local/lib/python3.10/dist-packages/rosbridge_server/websocket_handler.py", line 197, in prewrite_message
[rosbridge_websocket-2] future = self.write_message(message, binary)
[rosbridge_websocket-2] File "/usr/lib/python3/dist-packages/tornado/websocket.py", line 337, in write_message
[rosbridge_websocket-2] raise WebSocketClosedError()
[rosbridge_websocket-2] tornado.websocket.WebSocketClosedError

@achim-k
Copy link
Contributor Author

achim-k commented Oct 10, 2022

I think we can simply remove the raise in the snippet below and simply log a warning. That should be sufficient.

cls.node_handle.get_logger().warn(
"WebSocketClosedError: Tried to write to a closed websocket",
throttle_duration_sec=1.0,
)
raise

@MrBlenny
Copy link

Amazing! From my testing, both the hanging and the error spam are solved!

I also intermittently got the error:

[rosbridge_websocket-3] ERROR:tornado.application:Exception in callback <function main.<locals>.<lambda> at 0x7ffa191bbf70>
[rosbridge_websocket-3] Traceback (most recent call last):
[rosbridge_websocket-3]   File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 1229, in _run
[rosbridge_websocket-3]     return self.callback()
[rosbridge_websocket-3]   File "/opt/greenroom/whiskey_ros_gs_external/lib/rosbridge_server/rosbridge_websocket", line 330, in <lambda>
[rosbridge_websocket-3]     spin_callback = PeriodicCallback(lambda: rclpy.spin_once(node, timeout_sec=0.01), 1)
[rosbridge_websocket-3]   File "/opt/ros/galactic/lib/python3.8/site-packages/rclpy/__init__.py", line 176, in spin_once
[rosbridge_websocket-3]     executor.spin_once(timeout_sec=timeout_sec)
[rosbridge_websocket-3]   File "/opt/ros/galactic/lib/python3.8/site-packages/rclpy/executors.py", line 712, in spin_once
[rosbridge_websocket-3]     raise handler.exception()
[rosbridge_websocket-3]   File "/opt/ros/galactic/lib/python3.8/site-packages/rclpy/task.py", line 239, in __call__
[rosbridge_websocket-3]     self._handler.send(None)
[rosbridge_websocket-3]   File "/opt/ros/galactic/lib/python3.8/site-packages/rclpy/executors.py", line 418, in handler
[rosbridge_websocket-3]     await call_coroutine(entity, arg)
[rosbridge_websocket-3]   File "/opt/ros/galactic/lib/python3.8/site-packages/rclpy/executors.py", line 343, in _execute_subscription
[rosbridge_websocket-3]     await await_or_execute(sub.callback, msg)
[rosbridge_websocket-3]   File "/opt/ros/galactic/lib/python3.8/site-packages/rclpy/executors.py", line 107, in await_or_execute
[rosbridge_websocket-3]     return callback(*args)
[rosbridge_websocket-3]   File "/opt/greenroom/whiskey_ros_gs_external/lib/python3.8/site-packages/rosbridge_library/internal/subscribers.py", line 218, in callback
[rosbridge_websocket-3]     for callback in callbacks:
[rosbridge_websocket-3] RuntimeError: dictionary changed size during iteration
[rosbridge_websocket-3] [WARN] [1665466611.521004707] [rosbridge_websocket]: WebSocketClosedError: Tried to write to a closed websocket
[rosbridge_websocket-3] ERROR:tornado.application:Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7ffa182e38b0>, <Future finished exception=WebSocketClosedError()>)
[rosbridge_websocket-3] Traceback (most recent call last):
[rosbridge_websocket-3]   File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 758, in _run_callback
[rosbridge_websocket-3]     ret = callback()
[rosbridge_websocket-3]   File "/usr/lib/python3/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
[rosbridge_websocket-3]     return fn(*args, **kwargs)
[rosbridge_websocket-3]   File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 779, in _discard_future_result
[rosbridge_websocket-3]     future.result()
[rosbridge_websocket-3]   File "/usr/lib/python3/dist-packages/tornado/gen.py", line 326, in wrapper
[rosbridge_websocket-3]     yielded = next(result)
[rosbridge_websocket-3]   File "/opt/greenroom/whiskey_ros_gs_external/lib/python3.8/site-packages/rosbridge_server/websocket_handler.py", line 197, in prewrite_message
[rosbridge_websocket-3]     future = self.write_message(message, binary)
[rosbridge_websocket-3]   File "/usr/lib/python3/dist-packages/tornado/websocket.py", line 259, in write_message
[rosbridge_websocket-3]     raise WebSocketClosedError()
[rosbridge_websocket-3] tornado.websocket.WebSocketClosedError

But, this just happens once when the client disconnects. I really appreciate your and @p0rys work on this - has helped me a heap at just the time I needed it :)

@achim-k
Copy link
Contributor Author

achim-k commented Oct 11, 2022

[rosbridge_websocket-3]   File "/opt/greenroom/whiskey_ros_gs_external/lib/python3.8/site-packages/rosbridge_library/internal/subscribers.py", line 218, in callback
[rosbridge_websocket-3]     for callback in callbacks:
[rosbridge_websocket-3] RuntimeError: dictionary changed size during iteration

This exception is alarming, but I found the reason for it: In

if not callbacks:
with self.rlock:
callbacks = self.subscriptions.values()
we only create a reference to the callbacks, not a deep copy. The best would probably be to protect the whole function with the reentrant mutex.

@zflat
Copy link
Contributor

zflat commented Oct 11, 2022

Small nitpick with 95905f8
Possibly put a comment where the raise used to be explaining that re-raising is intentionally not happening because sudden client disconnects (as happens with rapid page refreshes) are unavoidable. If warnings are frequent then there may be underlying problems, but occasional warnings are expected. At least, that is my interpretation to having the warning but not re-raising the exception.

The reason I think the comment would help is because other except blocks do re-raise and it is confusing looking at the code seeing one place that does not do this.

Co-authored-by: Jacob Bandes-Storch <jacob@foxglove.dev>
@ramlab-jose
Copy link

Sorry for commenting on a long-ago closed PR but I have a hunch that the issue I opened a few days ago might explain why the exceptions keep being thrown in the background and why they keep being thrown in the first place. If anyone is still interested in this feel free to chime in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants