Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Deadlock in Dynlink test on Cygwin+MinGW+MSVC #307

Open
shym opened this issue Mar 9, 2023 · 8 comments
Open

[ocaml5-issue] Deadlock in Dynlink test on Cygwin+MinGW+MSVC #307

shym opened this issue Mar 9, 2023 · 8 comments
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@shym
Copy link
Collaborator

shym commented Mar 9, 2023

Deadlock observed in a run on trunk Cygwin:
https://github.com/shym/multicoretests/actions/runs/4367430739/jobs/7638729550#step:21:764

Wed, 08 Mar 2023 22:46:19 GMT random seed: 366632243
Wed, 08 Mar 2023 22:46:19 GMT generated error fail pass / total     time test name
Wed, 08 Mar 2023 22:46:19 GMT
Wed, 08 Mar 2023 22:46:19 GMT [ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain
Wed, 08 Mar 2023 22:47:55 GMT [ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain (generating)
Wed, 08 Mar 2023 22:49:08 GMT [ ]    0    0    0    0 /  100    96.0s negative Lin DSL Dynlink test with Domain (shrinking:    1)
Thu, 09 Mar 2023 00:38:34 GMT [ ]    0    0    0    0 /  100   168.9s negative Lin DSL Dynlink test with Domain (shrinking:    3)
Thu, 09 Mar 2023 00:38:34 GMT Error: The operation was canceled.

As the code paths are completely different between Cygwin (which provides a dlopen()) and Windows (which uses flexdll), this is probably not related to #290.

@dra27
Copy link

dra27 commented Mar 9, 2023

Ah - actually, we do still use flexdll on Cygwin, so I expect this is related.

@jmid jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Mar 28, 2023
@shym
Copy link
Collaborator Author

shym commented Apr 5, 2023

Another point of interest, related but maybe involving something else: the Dynlink test on Cygwin can end up abruptly (like a segfault) but reporting no error ($? is 0). There might be another issue there, in the way some errors get dropped? 🤔

@jmid
Copy link
Collaborator

jmid commented Sep 5, 2023

Seen again on Cygwin 5.1.0~rc2 when merging #389 into main
https://github.com/ocaml-multicore/multicoretests/actions/runs/6077015167/job/16485985404

random seed: 153981
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain
[ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain (generating)
Error: The operation was canceled.

@jmid
Copy link
Collaborator

jmid commented Oct 12, 2023

This triggered again on the 0.3 branch for Cygwin trunk part1
https://github.com/ocaml-multicore/multicoretests/actions/runs/6481561492/job/17599326796

random seed: 406303381
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain
[ ]    0    0    0    0 /  100     0.0s negative Lin DSL Dynlink test with Domain (generating)
Error: The operation was canceled.

@jmid
Copy link
Collaborator

jmid commented Mar 15, 2024

I've spent some time creating a reproducer for this:

libB.ml:

let value = 34

repro.ml:

let loadfile f =
  try Dynlink.loadfile (Dynlink.adapt_filename f)
  with Dynlink.Error (Dynlink.Module_already_loaded _) -> ()

let dont_crash () =
  let wait = Atomic.make true in
  let dom1 = Domain.spawn (fun () ->
			     while Atomic.get wait do Domain.cpu_relax() done;
  		             loadfile "libB.cmxs") in
  let dom2 = Domain.spawn (fun () ->
			     Atomic.set wait false;
			     loadfile "libB.cmxs") in
  let _ = Domain.join dom1 in
  let _ = Domain.join dom2 in
  ()

let _ =
  for i=1 to 1000 do
    Printf.printf "%i %!" i;
    dont_crash ()
  done

Makefile:

all:
	ocamlopt -g -shared libB.ml -o libB.cmxs
	ocamlopt -g -I +dynlink dynlink.cmxa repro.ml -o repro.exe

clean:
	rm -f libB.cmi libB.cmx libB.o libB.cmxs repro.cmi repro.cmx repro.o repro.exe

On MinGW (5.1.0, 5.1.1, 5.2.0~alpha1, trunk) this causes a range of different errors

  • hangs
  • segfaults
  • early exits
  • various Dynlink.Errors (bad object, not an OCaml plugin, missing frametable for libB, ...)

On MinGW 5.0.0 the errors trigger more rarely (but can still occur).
On Cygwin I've observed similar behaviour (no segfaults though).
On Linux I've not been able to trigger the issue.

I've found this: ocaml/flexdll#120 which ticks the right boxes, as I believe flexdll is involved on both MinGW and Cygwin (according to David's remark above).
So there seem to be a flexdll issue remaining in addition to ocaml/flexdll#112 @shym 😬

@jmid jmid changed the title [ocaml5-issue] Deadlock in Dynlink test on Cygwin [ocaml5-issue] Deadlock in Dynlink test on Cygwin+MinGW Mar 26, 2024
@jmid
Copy link
Collaborator

jmid commented Mar 26, 2024

The weekly 5.1.1 run triggered a Dynlink stress test crash on MinGW:
https://github.com/ocaml-multicore/multicoretests/actions/runs/8406318839/job/23020071322

random seed: 398767628
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Dynlink stress test with Domain
File "src/dynlink/dune", line 14, characters 7-16:
14 |  (name lin_tests)
            ^^^^^^^^^
(cd _build/default/src/dynlink && ./lin_tests.exe --verbose)
Command exited with code -1073741819.

@jmid
Copy link
Collaborator

jmid commented Mar 26, 2024

FTR, while dusting off #399 for merging, I discovered that the parallel Dynlink issue also affects MSVC - because it also uses FlexDLL under the surface.

Here's an example MSVC trunk run (which I got running before bytecode):
https://github.com/ocaml-multicore/multicoretests/actions/runs/8438568844/job/23111051221?pr=399

random seed: 373847262
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Dynlink stress test with Domain
File "src/dynlink/dune", line 14, characters 7-16:
14 |  (name lin_tests)
            ^^^^^^^^^
(cd _build/default/src/dynlink && ./lin_tests.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 / 1000     0.0s Lin Dynlink stress test with Domain (generating)

@jmid jmid changed the title [ocaml5-issue] Deadlock in Dynlink test on Cygwin+MinGW [ocaml5-issue] Deadlock in Dynlink test on Cygwin+MinGW+MSVC Apr 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

3 participants