Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killing a service from AWS fails with RPC error #1116

Open
andreyv opened this issue Feb 24, 2019 · 2 comments
Open

Killing a service from AWS fails with RPC error #1116

andreyv opened this issue Feb 24, 2019 · 2 comments
Milestone

Comments

@andreyv
Copy link
Member

andreyv commented Feb 24, 2019

Steps to reproduce:

  1. Open administrator interface and navigate to Resource Usage -> All.
  2. Try to kill a Worker.

Actual result in AWS logs:

2019-02-24 17:34:29,727 - ERROR [Admin,0 20 rpc::process_incoming_response] ResourceService,0 signaled RPC for method kill_service was unsuccessful: RPCError: Write failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/cms-1.5.dev0-py3.6.egg/cms/io/rpc.py", line 402, in process_incoming_request
    response["__data"] = method(**request["__data"])
  File "/usr/local/lib/python3.6/dist-packages/cms-1.5.dev0-py3.6.egg/cms/service/ResourceService.py", line 444, in kill_service
    return result.get()
  File "/usr/lib/python3/dist-packages/gevent/event.py", line 375, in get
    return self._raise_exception()
  File "/usr/lib/python3/dist-packages/gevent/event.py", line 355, in _raise_exception
    reraise(*self.exc_info)
  File "/usr/lib/python3/dist-packages/gevent/_compat.py", line 34, in reraise
    raise value
cms.io.rpc.RPCError: Write failed.
.

Meanwhile, ResourceService reports that everything is fine:

2019-02-24 17:34:29,724 - INFO [Resource,0 7 ResourceService::kill_service] Killing Worker,0 as asked.
2019-02-24 17:34:29,727 - INFO [Resource,0 8 rpc::initialize] Established connection with localhost:26000 (Worker,0) (local address: 127.0.0.1:58716).
@andreyv
Copy link
Member Author

andreyv commented Aug 19, 2019

The underlying error is OSError: Not connected., coming from cms/io/rpc.py:

cms/cms/io/rpc.py

Lines 263 to 264 in d4c9e92

if not self.connected:
raise OSError("Not connected.")

In kill_service(), service.connected is still false even when the RPC call is executed here:

remote_service = self.connect_to(ServiceCoord(name, shard))
result = remote_service.quit(reason="Asked by ResourceService")

I believe this is a race condition between the above code and the connection loop that is started in

cms/cms/io/rpc.py

Lines 503 to 509 in d4c9e92

def connect(self):
"""Connect and start the main loop.
"""
if self._loop is not None and not self._loop.ready():
raise RuntimeError("Already (auto-re)connecting")
self._loop = gevent.spawn(self._run)

service.connect() just starts the loop and returns without waiting until the connection is established.

Inserting time.sleep(1) before remote_service.quit() makes the call succeed. (EDIT: Should've used gevent.sleep() instead, that works too.)

Perhaps the connect() function should be made synchronous, or RPC calls should check and wait for a "connected" semaphore before actually sending data.

andreyv added a commit to lio-lv/cms that referenced this issue Aug 19, 2019
@andreyv
Copy link
Member Author

andreyv commented Aug 28, 2019

Looks like this is a regression from f954c2f.

@wil93 wil93 added this to the P1 milestone Nov 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants