Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prominently document that Mutex, Condition, ... might not behave as expected with Domainslib #127

Open
michael-schwarz opened this issue May 2, 2024 · 1 comment

Comments

@michael-schwarz
Copy link

Thank you for this nice project, we found it quite helpful in our ongoing efforts to parallelize a fixpoint algorithm in OCaml.

A quick suggestion: It might be a good idea to prominently document that Mutex and Condition will not work out of the box as one might expect when combined with Domainslib. This will help people new to Multicore avoid going down a potentially time-consuming rabbit hole. (Apologies if there is such a remark somewhere, I re-checked and still did not find any).

Details (can be skipped by people familiar with the difference in behavior)

It took us quite a while to understand why our algorithm was not terminating and sometimes throwing exceptions, and we managed to extract this example:

open Domainslib
let main () =
  let mutex = Mutex.create () in
  let pool = T.setup_pool ~num_domains:2 () in
  let task () =
    for i = 0 to 1000 do
      (
       Mutex.lock mutex;
       let work = T.async pool (fun () -> ()) in
       Task.await pool work;
       Mutex.unlock mutex
      )
    done
  in
  Domainslib.Task.run pool (fun () ->
    let p = T.async pool (fun () -> task ()) in
    let p1 = T.async pool (fun () -> task ()) in
    let p2 = T.async pool (fun () -> task ()) in
    let p3 = T.async pool (fun () -> task ()) in
    Task.await pool p; 
    Task.await pool p1; 
    Task.await pool p2; 
    Task.await pool p3; 
  );
  ()

let _ = main ()

which will either crash with

michael@michael-XPS-13-9360:~/Documents/td-parallel$ _build/default/mutexproblem.exe 
Locking thread different from unlocking thread
Fatal error: exception Sys_error("Mutex.unlock: Operation not permitted")

or deadlock.

We had a similar problem also when we tried using a condition variable to wait until a certain number of tasks had reached a certain point, which did deadlock (for n domains) as soon as n tasks had reached that point.

After looking into how Domainslib works, it of course becomes clear that one would have to use something akin to, e.g., https://github.com/ocaml-multicore/domain-local-await.

@polytypic
Copy link
Contributor

Yes, this has been a known issue for a long time. See issue #126 here and remark here, for example.

You mentioned domain-local-await. Yes, that currently works with Domainslib and Eio. I'm currently working on Picos, which aims to provide a more comprehensive and more widely accepted solution to interoperability and replace domain-local-await and domain-local-timeout. Picos already provides replacements for the Stdlib Mutex and Condition. Unfortunately, no existing scheduler (aside from the sample schedulers in the Picos package) currently provides full compatibility with Picos. Hopefully we'll get a chance at some point to rewrite the internals of Domainslib to use Picos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants