Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in channel implementation using mutex and condition variables #25

Open
SumitPadhiyar opened this issue May 4, 2021 · 3 comments
Open

Comments

@SumitPadhiyar
Copy link

SumitPadhiyar commented May 4, 2021

There seem to be a bug in channel implementation using mutex+condition variable as implemented in PR #17 which causes task_exn test to deadlock or loop infintely. The bug manifest itself only in some runs otherwise it works fine. I was able to find this subtle bug using ParaFuzz.

Steps to reproduce

  1. Install ParaFuzz with its dependencies.
  2. Install domainslib based on PR Implement channels using Mutex and Condition Variables #17 & modified to work with parafuzz.
  3. Compile domainslib's test_exn.ml:
ocamlfind opt -linkpkg -package parafuzz-lib,domainslib task_exn.ml
  1. Execute binary few times to catch the bug.

Root cause

I had debugged the test program and suspect that this has to do with channel implementation using single mutex+condvar for all channels. Exact root cause is when two receivers are waiting for a message here and when sender sends a message here, it causes message to get lost. Receivers are waiting on same condition variable for the sender to write to a ref unique per waiting receiver. But the problem occurs when sender writes to ref r1(waited upon by receiver 1) and Condition_variable.signal signals wakes up receiver 2 which finds ref r2 not updated, so it waits again. So even after sending a message, both receiver are still waiting and the message is lost. The bug is mainly caused by the underlying assumption that waiting threads are woken up in the order in which they wait on the condition variable which is incorrect according to pthread_cond_signal documentation.

Fix

Fix is to use a unique mutex+condition_variable for each waiting receiver.

@ctk21
Copy link
Contributor

ctk21 commented May 4, 2021

Thanks for fuzzing domainslib, it is useful.

I'm a bit confused how your proposed fix will work. Isn't there a unique mutex+condition_variable for each waiting receiver in recv' already:

let mc = Domain.DLS.get mutex_condvar_key in

I can see there is the problem if domainslib is used with systhreads.

@ctk21
Copy link
Contributor

ctk21 commented May 4, 2021

With systhreads and continuations/fibers it is possible for multiple execution contexts (aka 'threads') to share the same Domain.DLS.get mutex_condvar_key since Domain.DLS is specific at the domain level.

I make fixing the problem with systhreads and continuations/fibers being to replace Condition.signal with Condition.broadcast.

@SumitPadhiyar
Copy link
Author

With systhreads and continuations/fibers it is possible for multiple execution contexts (aka 'threads') to share the same Domain.DLS.get mutex_condvar_key since Domain.DLS is specific at the domain level.

I make fixing the problem with systhreads and continuations/fibers being to replace Condition.signal with Condition.broadcast.

I agree on the fix of replacing Condition.signal with Condition.broadcast as waiting receivers will still check for ref Msg_slot to be written, thereby waking up a single waiting receiver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants