Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multicast_observer deadlock #556

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mxgrey
Copy link

@mxgrey mxgrey commented Aug 25, 2021

This PR fixes #555 and provides a regression test that minimally recreates the deadlocking issue.

If you checkout Rx/v2/test/CMakeLists.txt and Rx/v2/test/subscriptions/race_condition.cpp into the current master branch and try to run rxcpp_test_race_condition, you will most likely find that the test gets permanently deadlocked. The deadlocking result is not guaranteed, since it depends on a race condition that has a very narrow window, but it's very probable that the race condition will occur (at least on my machine, it has been). If it doesn't deadlock on the first try, then the test can be rerun repeatedly until it eventually does get deadlocked.

But if you run the test on this branch, with the changes to multicast_observer, you will find that it never gets deadlocked no matter how many times the test is repeated. The deadlock is fixed by simply changing the mutex of multicast_observer::state_type to a std::recursive_mutex which guarantees that it is safe to recursively lock it (which is what causes the deadlock when using a plain std::mutex).

I understand that some people consider std::recursive_mutex to be something that should always be avoided, because it may be indicative of a design flaw in the memory access strategy. But after putting a lot of thought into this specific deadlocking issue, I don't see an alternative fix for this without a dramatic redesign of multicast_observer.

While this deadlocking is a very rare occurrence, it may be an extremely impactful one. Whenever this occurs, it permanently locks up an entire thread. In my own use case, I have a design pattern where I assign an entire category of operations to be observed on a specific rxcpp::schedulers::worker (essentially using the worker as a bottleneck to ensure mutually exclusive access to a certain cluster of data), and if that worker ever gets caught in this deadlock, then I'll permanently lose the ability to perform that entire category of operations.

So while it might be desirable to do a redesign to avoid the need for a std::recursive_mutex I would strongly encourage using it as a stopgap until an alternative solution can be found and implemented, assuming that might take some time to work out.

…bserver

Signed-off-by: Michael X. Grey <grey@openrobotics.org>
Signed-off-by: Michael X. Grey <grey@openrobotics.org>
Signed-off-by: Michael X. Grey <grey@openrobotics.org>
@kirkshoop
Copy link
Member

This is great work thanks!
I think I found a simpler solution. can you try it out?
master...kirkshoop:multicastdeadlock

@mxgrey
Copy link
Author

mxgrey commented Aug 26, 2021

I considered unlocking the mutex before entering o.add(~). That would definitely resolve the risk of deadlock, but I wasn't sure if it could open up the risk of other race conditions.

For example, what if multicast_observer::on_completed() is called on another thread as soon as the mutex is unlocked, and s.unsubscribe() is triggered before o.add(~) has finished inserting the new subscription? Can we be sure that all unsubscribing callbacks will get triggered correctly?

I honestly don't know the answer to those questions, and I don't trust myself to have a good enough read on the whole add/unsubscribe process to be certain about whether or not it's safe. I think if we can't guarantee correct behavior when unlocking the mutex, it would be much better to avoid the risk by using the recursive mutex. But if it is possible to guarantee correct behavior after unlocking the mutex, then I have no objection to that approach.

@kirkshoop
Copy link
Member

that is a good point and I will think about that more.

I did run all my threading regression tests with no errors (these test cases are marked with the tag [perf]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock in multicast_observer
2 participants