Fix multicast_observer deadlock #556

mxgrey · 2021-08-25T16:35:49Z

This PR fixes #555 and provides a regression test that minimally recreates the deadlocking issue.

If you checkout Rx/v2/test/CMakeLists.txt and Rx/v2/test/subscriptions/race_condition.cpp into the current master branch and try to run rxcpp_test_race_condition, you will most likely find that the test gets permanently deadlocked. The deadlocking result is not guaranteed, since it depends on a race condition that has a very narrow window, but it's very probable that the race condition will occur (at least on my machine, it has been). If it doesn't deadlock on the first try, then the test can be rerun repeatedly until it eventually does get deadlocked.

But if you run the test on this branch, with the changes to multicast_observer, you will find that it never gets deadlocked no matter how many times the test is repeated. The deadlock is fixed by simply changing the mutex of multicast_observer::state_type to a std::recursive_mutex which guarantees that it is safe to recursively lock it (which is what causes the deadlock when using a plain std::mutex).

I understand that some people consider std::recursive_mutex to be something that should always be avoided, because it may be indicative of a design flaw in the memory access strategy. But after putting a lot of thought into this specific deadlocking issue, I don't see an alternative fix for this without a dramatic redesign of multicast_observer.

While this deadlocking is a very rare occurrence, it may be an extremely impactful one. Whenever this occurs, it permanently locks up an entire thread. In my own use case, I have a design pattern where I assign an entire category of operations to be observed on a specific rxcpp::schedulers::worker (essentially using the worker as a bottleneck to ensure mutually exclusive access to a certain cluster of data), and if that worker ever gets caught in this deadlock, then I'll permanently lose the ability to perform that entire category of operations.

So while it might be desirable to do a redesign to avoid the need for a std::recursive_mutex I would strongly encourage using it as a stopgap until an alternative solution can be found and implemented, assuming that might take some time to work out.

…bserver Signed-off-by: Michael X. Grey <grey@openrobotics.org>

Signed-off-by: Michael X. Grey <grey@openrobotics.org>

kirkshoop · 2021-08-26T02:30:42Z

This is great work thanks!
I think I found a simpler solution. can you try it out?
master...kirkshoop:multicastdeadlock

mxgrey · 2021-08-26T03:38:24Z

I considered unlocking the mutex before entering o.add(~). That would definitely resolve the risk of deadlock, but I wasn't sure if it could open up the risk of other race conditions.

For example, what if multicast_observer::on_completed() is called on another thread as soon as the mutex is unlocked, and s.unsubscribe() is triggered before o.add(~) has finished inserting the new subscription? Can we be sure that all unsubscribing callbacks will get triggered correctly?

I honestly don't know the answer to those questions, and I don't trust myself to have a good enough read on the whole add/unsubscribe process to be certain about whether or not it's safe. I think if we can't guarantee correct behavior when unlocking the mutex, it would be much better to avoid the risk by using the recursive mutex. But if it is possible to guarantee correct behavior after unlocking the mutex, then I have no objection to that approach.

kirkshoop · 2021-08-26T04:46:17Z

that is a good point and I will think about that more.

I did run all my threading regression tests with no errors (these test cases are marked with the tag [perf]).

mxgrey added 3 commits August 26, 2021 00:11

Add a regression test the highlights a race condition for multicast_o…

63de2b1

…bserver Signed-off-by: Michael X. Grey <grey@openrobotics.org>

Add a comment to explain the very repetitive for-loop

1067017

Signed-off-by: Michael X. Grey <grey@openrobotics.org>

Fix multicast_observer::add deadlock by using a recursive mutex

f9d0d85

Signed-off-by: Michael X. Grey <grey@openrobotics.org>

mxgrey mentioned this pull request Sep 27, 2021

Use trunk version of rxcpp. Resolves #122 open-rmf/rmf_ros2#127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multicast_observer deadlock #556

Fix multicast_observer deadlock #556

mxgrey commented Aug 25, 2021 •

edited

kirkshoop commented Aug 26, 2021

mxgrey commented Aug 26, 2021

kirkshoop commented Aug 26, 2021

Fix multicast_observer deadlock #556

Are you sure you want to change the base?

Fix multicast_observer deadlock #556

Conversation

mxgrey commented Aug 25, 2021 • edited

kirkshoop commented Aug 26, 2021

mxgrey commented Aug 26, 2021

kirkshoop commented Aug 26, 2021

mxgrey commented Aug 25, 2021 •

edited