Race condition with nng_pipe_cb and live_sockets #40

chaoflow · 2019-05-29T15:29:05Z

With pynng 0.4.0 and Python 3.7.1 there seems to be a race condition involving _nng_pipe_cb.

Traceback (most recent call last):
  File "/venv/lib/python3.7/site-packages/pynng/nng.py", line 1211, in _nng_pipe_cb
    sock = _live_sockets[sock_id]
  File "/usr/lib/python3.7/weakref.py", line 137, in __getitem__
    o = self.data[key]()
KeyError: 139787429631984

@codypiersall Do you have any suggestions what could lead to a socket being expected in _live_sockets but not being there anymore or not yet and how to debug this?

So far I'm failing to produce a minimal working example. The code involves async generators with Pair0 sockets used as context managers and the race condition is triggered independent of whether arecv/asend or recv/send are being used.

Moving https://github.com/codypiersall/pynng/blob/master/pynng/nng.py#L308 to https://github.com/codypiersall/pynng/blob/master/pynng/nng.py#L295 seems to reduce the occurrence but does not eliminate the race condition.

The text was updated successfully, but these errors were encountered:

codypiersall · 2019-05-29T15:48:09Z

Thanks for opening the issue, @chaoflow.

You're 100% right about moving the addition to _live_sockets above the listen and dial calls. That's absolutely a race. If anyone else is following along, it's a race because it is possible that the socket starts listening, and it is possible that a remote connects to it before its id is added to the _live_sockets dict, which means when the callback gets called the socket won't be found in the _live_sockets dict.

There's a separate race inherent in my design here too, which is that when a socket is garbage collected on the Python end, it is possible that some of the asynchronous stuff happening in the underlying nng lib will still be running, but the socket won't be in the _live_sockets dict anymore. Nothing is

So there are (at least) two races: the one that you realized could be fixed by moving the line you moved, and one that can happen whenever a socket is gc'd.

After you made the fix you identified, do you think you're hitting the gc race, or do you think it may be a separate one?

codypiersall · 2019-05-29T15:53:29Z

Without modifying the source, a workaround for the first race is to not pass the listen or dial args to the socket constructor, but to call listen and dial on separate lines, if anyone else runs into this but doesn't want to modify the source.

chaoflow · 2019-05-29T17:32:57Z

@codypiersall Thank you for the very fast response and the workaround, which works. The remaining races occur around context exits. I think it is the gc race.

chaoflow · 2019-05-29T17:35:52Z

To be more precise, we exit the socket as context manager by also ending the async generator, i.e. the socket is out of scope (and garbage collected) right after closing it.

There was a race where the socket connect callbacks could get called before the socket was added to the global cache. Then, in the callback, the socket would try to be retrieved from the cache, but it wouldn't be there, resulting in an error. This was a pretty bad race, and it affected real code. This was brought up on #40 by GitHub user chaoflow. Many thanks to him for identifying the fix!

codypiersall · 2019-05-30T01:41:58Z

The race condition in Socket.__init__ is fixed in master (as of 0ea1c43). Still left is the race around socket removal; it's a little thornier.

I've also verified that it is a problem with the sockets getting GC'd (and thus removed from the _live_sockets WeakValueDictionary) by changing _live_sockets to a plain old dict. This isn't an acceptable actual solution by itself, though, because then we'd be leaking a lot of memory--every socket would be leaked! But I do think it is a necessary part of a solution, with some additional bookkeeping to make sure that all outstanding connect/disconnect callbacks have been called.

Something like the following might work:

change _live_sockets to a dict with the key of the pipe id and the val a tuple: (num_connected_pipes, pynng.Socket)
Register a pre_pipe_connect callback which will increment the num_connected_pipes if the socket is not closed.
Register a post_pipe_connect callback which will decrement the num_connected_pipes by one and also remove the socket from _live_sockets if num_connected_pipes is 0 and the socket is closed.
Add logic to pynng.Socket.close to remove itself from _live_sockets if num_connected_pipes is 0 while holding the _pipe_notify_lock.

I think this will work, but I have not tested it yet, and I do not think I'll get to it tonight.

By the way, here is a simple reproducing script:

import threading
import pynng

addr = 'inproc:///beepbeep'
sock = pynng.Pair0(listen=addr)
while True:
    p0 = pynng.Pair0(dial=addr, block_on_dial=False)

This also exposes some race between getting pipes from the socket objects as well, but that is a separate issue.

chaoflow · 2019-05-30T11:04:12Z

@codypiersall Are there legitimate cases where _nng_pipe_cb should run and do something meaningful despite the socket being already garbage collected or could we do something like:

def _nng_pipe_cb(lib_pipe, event, arg):
    sock_id = int(ffi.cast('size_t', arg))
    try:
        sock = _live_sockets[sock_id]
    except KeyError:
        logger.debug('Ignoring callback for garbage-collected socket')
        return
    with sock._pipe_notify_lock:

at https://github.com/codypiersall/pynng/blob/master/pynng/nng.py#L1213.

It would be nice though, to be sure that it got garbage collected and is not missing for some other reason.

codypiersall · 2019-05-30T13:31:10Z

Are there legitimate cases where _nng_pipe_cb should run and do something meaningful despite the socket being already garbage collected?

I can imagine legitimate use cases; whether they actually exist, I'm not sure. But, for example, someone may grab a resource in _pre_pipe_connect_cb and release it in post_pipe_remove_cb.

codypiersall · 2019-06-01T14:57:34Z

@chaoflow Which transport were you using when you saw these races? I'm seeing the GC race issue I was talking about only with the inproc transport. I'm going to do some spelunking in the nng library to understand what's going on under the hood.

chaoflow · 2019-06-01T15:10:13Z

@codypiersall That was with inproc as well.

codypiersall · 2019-06-05T03:15:32Z

After a little more investigation, it turns out that inproc was a red herring: in my ipc tests, the sockets were never actually establishing a connection, so I wasn't seeing a bug.

I've figured out that the root cause seems to be that pipe events can be called in an order I didn't expect. I expected that the events would strictly fire in the following order (I'm using the names from the nng lib):

NNG_PIPE_EV_ADD_PRE
NNG_PIPE_EV_ADD_POST
NNG_PIPE_EV_REM_POST

However, I have observed that the sequence above is not always what happens in practice. The code that I used to determine this I just threw at https://github.com/codypiersall/pynng/tree/pipe_evt_dbg.

I think this may be something that needs to be fixed in the nng lib. I'll look into it and see about posting a PR or an issue there.

chaoflow · 2019-06-05T14:28:24Z

@codypiersall Thank you very much for the update and for looking into it!

codypiersall · 2019-06-06T03:02:38Z

In a surprising turn of events, I was able to write a patch for nng and make the callbacks behave where they can never be called in order of NNG_PIPE_EV_REM_POST before NNG_PIPE_EV_ADD_POST. I'll open an issue in nng about getting this upstream in the next few days. My current implementation for nng was pretty hacky, so I need to clean it up, and see if Garrett is interested in fixing this upstream.

codypiersall · 2019-06-12T03:16:53Z

Soon I'll make a PR in nng with a branch I made that fixes this issue. I think I should be able to get that upstream. If my fix gets merged into nng, I'll close out this issue; otherwise, I'll add a step in the build process of pynng to patch nng with this fix. I think upstream will be amenable to the change in my branch though.

bsdis · 2019-08-18T10:53:56Z

Are there any updates on this issue?

codypiersall · 2019-08-19T14:29:43Z

I'm hoping to open an issue in nng, but haven't had time to do it lately. I'm confident we'll come to a solution eventually, but not sure about the timeline.

dongdyang · 2019-08-23T05:28:15Z

Soon I'll make a PR in nng with a branch I made that fixes this issue. I think I should be able to get that upstream. If my fix gets merged into nng, I'll close out this issue; otherwise, I'll add a step in the build process of pynng to patch nng with this fix. I think upstream will be amenable to the change in my branch though.

After I use the branch you mentioned to install nng and pynng, I found that pynng package could not pass pytest. I am wondering whether in your local it has successfully passed the test case?

The following is what I got:

pytest
================================================================ test session starts =================================================================
platform darwin -- Python 3.7.3, pytest-5.1.1, py-1.8.0, pluggy-0.12.0
rootdir: /Users/xxx/Workspaces/pynng
collected 50 items

test/test_aio.py sssss                                                                                                                         [ 10%]
test/test_api.py .....ss...                                                                                                                    [ 30%]
test/test_msg.py ..ss..s                                                                                                                       [ 44%]
test/test_options.py .....**F.**                                                                                                                   [ 58%]
test/test_pipe.py ........s.                                                                                                                   [ 78%]
test/test_protocols.py ...........                                                                                                             [100%]

Thanks for answer

codypiersall · 2019-09-27T02:29:44Z

Hi @dongdongzzz,

That failing test I believe is due to the way that nng formats the sockaddr changing. I got the same results.

I'm not convinced the branch I made in nng is actually the best way to fix the issue, however. At this point I'm not sure what the best way is.

codypiersall · 2019-09-27T02:39:58Z

So, I finally made a C reproducer that shows that there is a race condition in nng: nanomsg/nng#961 (comment)

That said, the race, as far as I can tell, can only happen when a socket calls dial with block=False and closes the dialing socket very quickly afterward. To work around this issue, just don't close the socket very quickly. I think in all real applications you probably wouldn't want to close the socket quickly anyway, but this is still a bug.

This needs to be fixed in nng, and I'll keep poking around to see if I can come up with a good solution. The PR I made does fix the issue, but I don't know that it's the best way to do it.

wtfuzz · 2020-01-18T02:57:06Z

My changes in pull request #55 might address this issue. The reason for the callbacks arriving out of order is most likely thread scheduling, since the add and remove events originate from two different threads internally to nng.

codypiersall added the bug Something isn't working label May 29, 2019

codypiersall mentioned this issue Jun 13, 2019

Fixes surprising possible ordering of nng_pipe_notify events. nanomsg/nng#961

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition with nng_pipe_cb and live_sockets #40

Race condition with nng_pipe_cb and live_sockets #40

chaoflow commented May 29, 2019

codypiersall commented May 29, 2019

codypiersall commented May 29, 2019

chaoflow commented May 29, 2019

chaoflow commented May 29, 2019

codypiersall commented May 30, 2019 •

edited

chaoflow commented May 30, 2019

codypiersall commented May 30, 2019

codypiersall commented Jun 1, 2019

chaoflow commented Jun 1, 2019

codypiersall commented Jun 5, 2019

chaoflow commented Jun 5, 2019

codypiersall commented Jun 6, 2019

codypiersall commented Jun 12, 2019

bsdis commented Aug 18, 2019

codypiersall commented Aug 19, 2019

dongdyang commented Aug 23, 2019

codypiersall commented Sep 27, 2019

codypiersall commented Sep 27, 2019

wtfuzz commented Jan 18, 2020 •

edited

Race condition with nng_pipe_cb and live_sockets #40

Race condition with nng_pipe_cb and live_sockets #40

Comments

chaoflow commented May 29, 2019

codypiersall commented May 29, 2019

codypiersall commented May 29, 2019

chaoflow commented May 29, 2019

chaoflow commented May 29, 2019

codypiersall commented May 30, 2019 • edited

chaoflow commented May 30, 2019

codypiersall commented May 30, 2019

codypiersall commented Jun 1, 2019

chaoflow commented Jun 1, 2019

codypiersall commented Jun 5, 2019

chaoflow commented Jun 5, 2019

codypiersall commented Jun 6, 2019

codypiersall commented Jun 12, 2019

bsdis commented Aug 18, 2019

codypiersall commented Aug 19, 2019

dongdyang commented Aug 23, 2019

codypiersall commented Sep 27, 2019

codypiersall commented Sep 27, 2019

wtfuzz commented Jan 18, 2020 • edited

codypiersall commented May 30, 2019 •

edited

wtfuzz commented Jan 18, 2020 •

edited