ZeroMQ IPC fails after a while #607

Kile · 2024-01-09T15:31:16Z

This has been an issue for a while. In a development environment, zeromq works perfectly fine, however not long after a restart of the production code zeromq requests will start failing silently. This makes vote rewards not work as well as all GET endpoints used for the website, rendering it nearly completely useless. Several attempted fixes were implemented but none have worked so far.

This issue occurs in these lines of code:
Sever:

Killua/killua/cogs/api.py

Lines 27 to 57 in 7bf697e

    
           async def start(self): 
        
               """Starts the zmq server asyncronously and handles incoming requests""" 
        
               context = Context() 
        
               auth = AsyncioAuthenticator(context) 
        
               auth.start() 
        
               auth.configure_plain(domain="*", passwords={"killua": IPC_TOKEN}) 
        
               auth.allow("127.0.0.1") 
        
               socket = context.socket(ROUTER) 
        
               socket.plain_server = True 
        
               socket.bind("tcp://*:5555") 
        
               poller = Poller() 
        
               poller.register(socket, POLLIN) 
        
               while True: 
        
                   socks = dict(await poller.poll()) 
        
                   if socket in socks and socks[socket] == POLLIN: 
        
                       message = await socket.recv_multipart() 
        
                       try: 
        
                           identity, _, request = message # Sometimes there may be an empty frame in the middle of the message 
        
                       except ValueError: 
        
                           identity, request = message 
        
                       decoded = loads(request.decode()) 
        
                       res = await getattr(self, decoded["route"])(decoded["data"]) 
        
                       if res: 
        
                           await socket.send_multipart([identity, dumps(res).encode()]) 
        
                       else: 
        
                           await socket.send_multipart([identity, b'{"status":"ok"}'])

Client:

Killua/killua/webhook/api.py

Lines 30 to 50 in 7bf697e

    
           async def make_request(route: str, data: dict) -> dict: 
        
               context = Context.instance() 
        
               socket = context.socket(DEALER) 
        
               socket.identity = uuid.uuid4().hex.encode('utf-8') 
        
               socket.plain_username = b"killua" 
        
               socket.plain_password = IPC_TOKEN.encode("UTF-8") 
        
               socket.connect("tcp://localhost:5555") 
        
               request = json.dumps({"route": route, "data": data}).encode('utf-8') 
        
               socket.send(request) 
        
               poller = Poller() 
        
               poller.register(socket, POLLIN) 
        
               while True: 
        
                   events = dict(await poller.poll()) 
        
                   if socket in events and events[socket] == POLLIN: 
        
                       multipart = json.loads((await socket.recv_multipart())[0].decode()) 
        
                       socket.close() 
        
                       context.term() 
        
                       return multipart

I suspected this was because of too many open connections but I am not sure if this is the case and I seem to close all connections. This is the output of an lsof command when this issue occurred in production:

Because this has been a longer ongoing issue and because it is quite important for the functionality I am turning this into an issue to keep track on the progress.

I have also asked this stack overflow question in hopes of a fix.

Kile · 2024-01-09T16:21:41Z

This seems to be an issue with the API, not zeromq. I can still internally request zeromq however the API fails. I remember it failing after a while before I created the website from time to time, it seems with the large number of additional requests this happens much faster. Only I am not sure why. I will continue investigating.

Kile · 2024-01-13T11:07:07Z

I have changed hypercorn to use 8 workers instead of 1 a few days ago and this seems to have helped this issue. The API has been without issue for multiple days now.

Kile · 2024-01-31T13:51:34Z

This issue is not resolved sadly. It is definitely a hypercorn issue. Increasing the number of workers only delays when the API starts timing out. I am looking into solutions.

Kile · 2024-05-27T01:10:28Z

This now may be resolved. While rewriting this API to rust, I believe I have found the root cause of this issue with the help of @y21.

The root cause was that zeromq, for some reason, in its default behaviour, prevents dropping pointers at the end of a function. So when my make_request function ends and everything up until that point worked as expected, it tries to drop the variables but is prevented continuously.

This means no error is raised but the code freezes at a low level which is insanely hard to trace.

Turns out this is default zmq behaviour but there thankfully is a method to change this behaviour. So a simple one line fixes this:

socket.set_linger(0)

That's it. That I what I have tried to find for 8 months. Hopefully this actually fixes it. I will keep this issue open for a bit, if I close it that was it.

Kile · 2024-05-27T13:39:28Z

Looking through the python implementation it is a bit harder to see because the linger argument will be passed to the underlying c implementation

Kile added the bug Something isn't working label Jan 9, 2024

Kile added this to the Version 1.0 milestone Jan 9, 2024

Kile self-assigned this Jan 9, 2024

Kile linked a pull request May 26, 2024 that will close this issue

Api rewrite to rust #624

Open

Kile linked a pull request May 27, 2024 that will close this issue

Api rewrite to rust #624

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroMQ IPC fails after a while #607

ZeroMQ IPC fails after a while #607

Kile commented Jan 9, 2024

Kile commented Jan 9, 2024

Kile commented Jan 13, 2024

Kile commented Jan 31, 2024

Kile commented May 27, 2024

Kile commented May 27, 2024

ZeroMQ IPC fails after a while #607

ZeroMQ IPC fails after a while #607

Comments

Kile commented Jan 9, 2024

Kile commented Jan 9, 2024

Kile commented Jan 13, 2024

Kile commented Jan 31, 2024

Kile commented May 27, 2024

Kile commented May 27, 2024